Initial production-ready Gemma 3 vLLM ROCm stack
Co-Authored-By: Oz <oz-agent@warp.dev>
This commit is contained in:
72
docs/ARCHITECTURE.md
Normal file
72
docs/ARCHITECTURE.md
Normal file
@ -0,0 +1,72 @@
|
||||
# Architecture
|
||||
## Component flow
|
||||
```text
|
||||
[Browser @ chat.bhatfamily.in]
|
||||
|
|
||||
| HTTPS (terminated externally)
|
||||
v
|
||||
[Host reverse proxy (external to this repo)]
|
||||
|
|
||||
| HTTP -> localhost:3000
|
||||
v
|
||||
[chat-ui container: Open WebUI]
|
||||
|
|
||||
| HTTP (docker internal network)
|
||||
v
|
||||
[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
|
||||
|
|
||||
| reads model weights/cache
|
||||
v
|
||||
[Hugging Face cache + local models dir]
|
||||
|
|
||||
| ROCm runtime
|
||||
v
|
||||
[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]
|
||||
```
|
||||
|
||||
## Services
|
||||
### `gemma3-vllm`
|
||||
- Image: `vllm/vllm-openai-rocm:latest`
|
||||
- Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
|
||||
- Host port mapping: `${BACKEND_PORT}:8000` (default `8000:8000`)
|
||||
- Device passthrough:
|
||||
- `/dev/kfd`
|
||||
- `/dev/dri`
|
||||
- Security/capabilities for ROCm debugging compatibility:
|
||||
- `cap_add: SYS_PTRACE`
|
||||
- `security_opt: seccomp=unconfined`
|
||||
- `group_add: video`
|
||||
|
||||
### `chat-ui`
|
||||
- Image: `ghcr.io/open-webui/open-webui:main`
|
||||
- Purpose: Browser chat experience with local persistence in mounted data directory.
|
||||
- Host port mapping: `${FRONTEND_PORT}:8080` (default `3000:8080`)
|
||||
- Upstream model endpoint on docker network:
|
||||
- `OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1`
|
||||
|
||||
## Networking
|
||||
- Docker Compose default bridge network is used.
|
||||
- `chat-ui` resolves `gemma3-vllm` by service name.
|
||||
- External access is via host ports:
|
||||
- API: `localhost:8000`
|
||||
- UI: `localhost:3000`
|
||||
|
||||
## Storage
|
||||
- Hugging Face cache bind mount:
|
||||
- Host: `${HUGGINGFACE_CACHE_DIR}`
|
||||
- Container: `/root/.cache/huggingface`
|
||||
- Optional local models directory:
|
||||
- Host: `./models`
|
||||
- Container: `/models`
|
||||
- Open WebUI data:
|
||||
- Host: `${OPEN_WEBUI_DATA_DIR}`
|
||||
- Container: `/app/backend/data`
|
||||
|
||||
## Scaling notes
|
||||
This repository is designed for **single-node deployment** on one AMD APU/GPU host.
|
||||
|
||||
For larger deployments later:
|
||||
- Move to dedicated GPUs with larger VRAM.
|
||||
- Use pinned vLLM image tags and explicit engine tuning.
|
||||
- Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
|
||||
- Add request routing, autoscaling, and centralized observability.
|
||||
Reference in New Issue
Block a user