73 lines
2.2 KiB
Markdown
73 lines
2.2 KiB
Markdown
# Architecture
|
|
## Component flow
|
|
```text
|
|
[Browser @ chat.bhatfamily.in]
|
|
|
|
|
| HTTPS (terminated externally)
|
|
v
|
|
[Host reverse proxy (external to this repo)]
|
|
|
|
|
| HTTP -> localhost:3000
|
|
v
|
|
[chat-ui container: Open WebUI]
|
|
|
|
|
| HTTP (docker internal network)
|
|
v
|
|
[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
|
|
|
|
|
| reads model weights/cache
|
|
v
|
|
[Hugging Face cache + local models dir]
|
|
|
|
|
| ROCm runtime
|
|
v
|
|
[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]
|
|
```
|
|
|
|
## Services
|
|
### `gemma3-vllm`
|
|
- Image: `vllm/vllm-openai-rocm:latest`
|
|
- Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
|
|
- Host port mapping: `${BACKEND_PORT}:8000` (default `8000:8000`)
|
|
- Device passthrough:
|
|
- `/dev/kfd`
|
|
- `/dev/dri`
|
|
- Security/capabilities for ROCm debugging compatibility:
|
|
- `cap_add: SYS_PTRACE`
|
|
- `security_opt: seccomp=unconfined`
|
|
- `group_add: video`
|
|
|
|
### `chat-ui`
|
|
- Image: `ghcr.io/open-webui/open-webui:main`
|
|
- Purpose: Browser chat experience with local persistence in mounted data directory.
|
|
- Host port mapping: `${FRONTEND_PORT}:8080` (default `3000:8080`)
|
|
- Upstream model endpoint on docker network:
|
|
- `OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1`
|
|
|
|
## Networking
|
|
- Docker Compose default bridge network is used.
|
|
- `chat-ui` resolves `gemma3-vllm` by service name.
|
|
- External access is via host ports:
|
|
- API: `localhost:8000`
|
|
- UI: `localhost:3000`
|
|
|
|
## Storage
|
|
- Hugging Face cache bind mount:
|
|
- Host: `${HUGGINGFACE_CACHE_DIR}`
|
|
- Container: `/root/.cache/huggingface`
|
|
- Optional local models directory:
|
|
- Host: `./models`
|
|
- Container: `/models`
|
|
- Open WebUI data:
|
|
- Host: `${OPEN_WEBUI_DATA_DIR}`
|
|
- Container: `/app/backend/data`
|
|
|
|
## Scaling notes
|
|
This repository is designed for **single-node deployment** on one AMD APU/GPU host.
|
|
|
|
For larger deployments later:
|
|
- Move to dedicated GPUs with larger VRAM.
|
|
- Use pinned vLLM image tags and explicit engine tuning.
|
|
- Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
|
|
- Add request routing, autoscaling, and centralized observability.
|