Files
gemma3-vllm-stack/docs/ARCHITECTURE.md
2026-04-18 22:53:46 +05:30

73 lines
2.2 KiB
Markdown

# Architecture
## Component flow
```text
[Browser @ chat.bhatfamily.in]
|
| HTTPS (terminated externally)
v
[Host reverse proxy (external to this repo)]
|
| HTTP -> localhost:3000
v
[chat-ui container: Open WebUI]
|
| HTTP (docker internal network)
v
[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
|
| reads model weights/cache
v
[Hugging Face cache + local models dir]
|
| ROCm runtime
v
[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]
```
## Services
### `gemma3-vllm`
- Image: `vllm/vllm-openai-rocm:latest`
- Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
- Host port mapping: `${BACKEND_PORT}:8000` (default `8000:8000`)
- Device passthrough:
- `/dev/kfd`
- `/dev/dri`
- Security/capabilities for ROCm debugging compatibility:
- `cap_add: SYS_PTRACE`
- `security_opt: seccomp=unconfined`
- `group_add: video`
### `chat-ui`
- Image: `ghcr.io/open-webui/open-webui:main`
- Purpose: Browser chat experience with local persistence in mounted data directory.
- Host port mapping: `${FRONTEND_PORT}:8080` (default `3000:8080`)
- Upstream model endpoint on docker network:
- `OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1`
## Networking
- Docker Compose default bridge network is used.
- `chat-ui` resolves `gemma3-vllm` by service name.
- External access is via host ports:
- API: `localhost:8000`
- UI: `localhost:3000`
## Storage
- Hugging Face cache bind mount:
- Host: `${HUGGINGFACE_CACHE_DIR}`
- Container: `/root/.cache/huggingface`
- Optional local models directory:
- Host: `./models`
- Container: `/models`
- Open WebUI data:
- Host: `${OPEN_WEBUI_DATA_DIR}`
- Container: `/app/backend/data`
## Scaling notes
This repository is designed for **single-node deployment** on one AMD APU/GPU host.
For larger deployments later:
- Move to dedicated GPUs with larger VRAM.
- Use pinned vLLM image tags and explicit engine tuning.
- Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
- Add request routing, autoscaling, and centralized observability.