Initial production-ready Gemma 3 vLLM ROCm stack

Co-Authored-By: Oz <oz-agent@warp.dev>
2026-04-18 22:53:38 +05:30
commit ef8537e923
18 changed files with 988 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@ -0,0 +1,72 @@
+# Architecture
+## Component flow
+```text
+[Browser @ chat.bhatfamily.in]
+          |
+          | HTTPS (terminated externally)
+          v
+[Host reverse proxy (external to this repo)]
+          |
+          | HTTP -> localhost:3000
+          v
+[chat-ui container: Open WebUI]
+          |
+          | HTTP (docker internal network)
+          v
+[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
+          |
+          | reads model weights/cache
+          v
+[Hugging Face cache + local models dir]
+          |
+          | ROCm runtime
+          v
+[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]
+```
+
+## Services
+### `gemma3-vllm`
+- Image: `vllm/vllm-openai-rocm:latest`
+- Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
+- Host port mapping: `${BACKEND_PORT}:8000` (default `8000:8000`)
+- Device passthrough:
+  - `/dev/kfd`
+  - `/dev/dri`
+- Security/capabilities for ROCm debugging compatibility:
+  - `cap_add: SYS_PTRACE`
+  - `security_opt: seccomp=unconfined`
+  - `group_add: video`
+
+### `chat-ui`
+- Image: `ghcr.io/open-webui/open-webui:main`
+- Purpose: Browser chat experience with local persistence in mounted data directory.
+- Host port mapping: `${FRONTEND_PORT}:8080` (default `3000:8080`)
+- Upstream model endpoint on docker network:
+  - `OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1`
+
+## Networking
+- Docker Compose default bridge network is used.
+- `chat-ui` resolves `gemma3-vllm` by service name.
+- External access is via host ports:
+  - API: `localhost:8000`
+  - UI: `localhost:3000`
+
+## Storage
+- Hugging Face cache bind mount:
+  - Host: `${HUGGINGFACE_CACHE_DIR}`
+  - Container: `/root/.cache/huggingface`
+- Optional local models directory:
+  - Host: `./models`
+  - Container: `/models`
+- Open WebUI data:
+  - Host: `${OPEN_WEBUI_DATA_DIR}`
+  - Container: `/app/backend/data`
+
+## Scaling notes
+This repository is designed for **single-node deployment** on one AMD APU/GPU host.
+
+For larger deployments later:
+- Move to dedicated GPUs with larger VRAM.
+- Use pinned vLLM image tags and explicit engine tuning.
+- Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
+- Add request routing, autoscaling, and centralized observability.
--- a/docs/README.md
+++ b/docs/README.md
@ -0,0 +1,14 @@
+# Documentation Index
+This folder contains operational and lifecycle documentation for the `gemma3-vllm-stack` repository.
+
+## Files
+- `ARCHITECTURE.md`: Component topology, networking, runtime dependencies, and scaling notes.
+- `TROUBLESHOOTING.md`: Common failures and copy-paste diagnostics/fixes for ROCm, Docker, vLLM, and UI issues.
+- `UPGRADE_NOTES.md`: Safe upgrade, rollback, and backup guidance.
+
+## Recommended reading order
+1. `ARCHITECTURE.md`
+2. `TROUBLESHOOTING.md`
+3. `UPGRADE_NOTES.md`
+
+For quick start and day-1 usage, use the repository root `README.md`.
--- a/docs/TROUBLESHOOTING.md
+++ b/docs/TROUBLESHOOTING.md
@ -0,0 +1,172 @@
+# Troubleshooting
+## ROCm devices not visible in host
+Symptoms:
+- `/dev/kfd` missing
+- `/dev/dri` missing
+- vLLM fails to start with ROCm device errors
+
+Checks:
+```bash
+ls -l /dev/kfd /dev/dri
+id
+getent group video
+```
+
+Expected:
+- `/dev/kfd` exists
+- `/dev/dri` directory exists
+- user belongs to `video` group
+
+Fixes:
+```bash
+sudo usermod -aG video "$USER"
+newgrp video
+```
+Then verify ROCm tools:
+```bash
+rocminfo | sed -n '1,120p'
+```
+If ROCm is not healthy, fix host ROCm installation first.
+
+---
+
+## Docker and Compose not available
+Symptoms:
+- `docker: command not found`
+- `docker compose version` fails
+
+Checks:
+```bash
+docker --version
+docker compose version
+```
+
+Fix using install script (Ubuntu):
+```bash
+./scripts/install.sh
+```
+Manual fallback:
+```bash
+sudo apt-get update
+sudo apt-get install -y ca-certificates curl gnupg
+sudo install -m 0755 -d /etc/apt/keyrings
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
+sudo chmod a+r /etc/apt/keyrings/docker.gpg
+echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
+sudo apt-get update
+sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+sudo usermod -aG docker "$USER"
+```
+Log out/in after group change.
+
+---
+
+## vLLM container exits or fails healthchecks
+Symptoms:
+- `gemma3-vllm` restarting
+- API endpoint unavailable
+
+Checks:
+```bash
+docker compose ps
+docker compose logs --tail=200 gemma3-vllm
+```
+
+Common causes and fixes:
+1. Missing/invalid Hugging Face token:
+```bash
+grep -E '^(HF_TOKEN|GEMMA_MODEL_ID)=' .env
+```
+Ensure `HF_TOKEN` is set to a valid token with access to Gemma 3.
+
+2. Model ID typo:
+```bash
+grep '^GEMMA_MODEL_ID=' .env
+```
+Use a valid model, e.g. `google/gemma-3-1b-it`.
+
+3. ROCm runtime/device issues:
+```bash
+docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:22.04 bash -lc 'ls -l /dev/kfd /dev/dri'
+```
+
+4. API key mismatch between backend and UI/tests:
+```bash
+grep -E '^(VLLM_API_KEY|OPENAI_API_BASE_URL)=' .env frontend/config/frontend.env 2>/dev/null || true
+```
+Keep keys consistent.
+
+---
+
+## Out-of-memory (OOM) or low VRAM errors
+Symptoms:
+- startup failure referencing memory allocation
+- runtime generation failures
+
+Checks:
+```bash
+docker compose logs --tail=300 gemma3-vllm | grep -Ei 'out of memory|oom|memory|cuda|hip|rocm'
+```
+
+Mitigations:
+1. Reduce context length in `.env`:
+```bash
+VLLM_MAX_MODEL_LEN=2048
+```
+2. Lower GPU memory utilization target:
+```bash
+VLLM_GPU_MEMORY_UTILIZATION=0.75
+```
+3. Use a smaller Gemma 3 variant in `.env`.
+4. Restart stack:
+```bash
+./scripts/restart.sh
+```
+
+---
+
+## UI loads but cannot reach vLLM backend
+Symptoms:
+- Browser opens UI but chat requests fail.
+
+Checks:
+```bash
+docker compose ps
+docker compose logs --tail=200 chat-ui
+docker compose logs --tail=200 gemma3-vllm
+```
+
+Verify frontend backend URL:
+```bash
+grep -E '^OPENAI_API_BASE_URL=' frontend/config/frontend.env
+```
+Expected value:
+```text
+OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1
+```
+
+Verify API directly from host:
+```bash
+./scripts/test_api.sh
+```
+
+If API works from host but not UI, recreate frontend:
+```bash
+docker compose up -d --force-recreate chat-ui
+```
+
+---
+
+## Health checks and endpoint validation
+Run all smoke tests:
+```bash
+./scripts/test_api.sh
+./scripts/test_ui.sh
+python3 scripts/test_python_client.py
+```
+
+If one fails, inspect corresponding service logs and then restart:
+```bash
+docker compose logs --tail=200 gemma3-vllm chat-ui
+./scripts/restart.sh
+```
--- a/docs/UPGRADE_NOTES.md
+++ b/docs/UPGRADE_NOTES.md
@ -0,0 +1,50 @@
+# Upgrade Notes
+## Standard safe upgrade path
+From repository root:
+```bash
+git pull
+docker compose pull
+./scripts/restart.sh
+```
+Then run smoke tests:
+```bash
+./scripts/test_api.sh
+./scripts/test_ui.sh
+python3 scripts/test_python_client.py
+```
+
+## Versioning guidance
+- Prefer pinning image tags in `docker-compose.yml` once your deployment is stable.
+- Upgrading vLLM may change runtime defaults or engine behavior; check vLLM release notes before major version jumps.
+- Keep `GEMMA_MODEL_ID` explicit in `.env` to avoid unintentional model drift.
+
+## Model upgrade considerations
+When changing Gemma 3 variants (for example, from 1B to larger sizes):
+- Verify host RAM and GPU memory capacity.
+- Expect re-download of model weights and larger disk usage.
+- Re-tune:
+  - `VLLM_MAX_MODEL_LEN`
+  - `VLLM_GPU_MEMORY_UTILIZATION`
+- Re-run validation scripts after restart.
+
+## Backup recommendations
+Before major upgrades, back up local persistent data:
+```bash
+mkdir -p backups
+tar -czf backups/hf-cache-$(date +%Y%m%d-%H%M%S).tar.gz "${HOME}/.cache/huggingface"
+tar -czf backups/open-webui-data-$(date +%Y%m%d-%H%M%S).tar.gz frontend/data/open-webui
+```
+If you use local predownloaded models:
+```bash
+tar -czf backups/models-$(date +%Y%m%d-%H%M%S).tar.gz models
+```
+
+## Rollback approach
+If a new image/model combination fails:
+1. Revert `docker-compose.yml` and `.env` to previous known-good values.
+2. Pull previous pinned images (if pinned by tag/digest).
+3. Restart:
+```bash
+./scripts/restart.sh
+```
+4. Re-run smoke tests.