Initial production-ready Gemma 3 vLLM ROCm stack

Co-Authored-By: Oz <oz-agent@warp.dev>
2026-04-18 22:53:38 +05:30
commit ef8537e923
18 changed files with 988 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@ -0,0 +1,72 @@
+# Architecture
+## Component flow
+```text
+[Browser @ chat.bhatfamily.in]
+          |
+          | HTTPS (terminated externally)
+          v
+[Host reverse proxy (external to this repo)]
+          |
+          | HTTP -> localhost:3000
+          v
+[chat-ui container: Open WebUI]
+          |
+          | HTTP (docker internal network)
+          v
+[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
+          |
+          | reads model weights/cache
+          v
+[Hugging Face cache + local models dir]
+          |
+          | ROCm runtime
+          v
+[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]
+```
+
+## Services
+### `gemma3-vllm`
+- Image: `vllm/vllm-openai-rocm:latest`
+- Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
+- Host port mapping: `${BACKEND_PORT}:8000` (default `8000:8000`)
+- Device passthrough:
+  - `/dev/kfd`
+  - `/dev/dri`
+- Security/capabilities for ROCm debugging compatibility:
+  - `cap_add: SYS_PTRACE`
+  - `security_opt: seccomp=unconfined`
+  - `group_add: video`
+
+### `chat-ui`
+- Image: `ghcr.io/open-webui/open-webui:main`
+- Purpose: Browser chat experience with local persistence in mounted data directory.
+- Host port mapping: `${FRONTEND_PORT}:8080` (default `3000:8080`)
+- Upstream model endpoint on docker network:
+  - `OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1`
+
+## Networking
+- Docker Compose default bridge network is used.
+- `chat-ui` resolves `gemma3-vllm` by service name.
+- External access is via host ports:
+  - API: `localhost:8000`
+  - UI: `localhost:3000`
+
+## Storage
+- Hugging Face cache bind mount:
+  - Host: `${HUGGINGFACE_CACHE_DIR}`
+  - Container: `/root/.cache/huggingface`
+- Optional local models directory:
+  - Host: `./models`
+  - Container: `/models`
+- Open WebUI data:
+  - Host: `${OPEN_WEBUI_DATA_DIR}`
+  - Container: `/app/backend/data`
+
+## Scaling notes
+This repository is designed for **single-node deployment** on one AMD APU/GPU host.
+
+For larger deployments later:
+- Move to dedicated GPUs with larger VRAM.
+- Use pinned vLLM image tags and explicit engine tuning.
+- Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
+- Add request routing, autoscaling, and centralized observability.