Initial production-ready Gemma 3 vLLM ROCm stack
Co-Authored-By: Oz <oz-agent@warp.dev>
This commit is contained in:
72
docs/ARCHITECTURE.md
Normal file
72
docs/ARCHITECTURE.md
Normal file
@ -0,0 +1,72 @@
|
||||
# Architecture
|
||||
## Component flow
|
||||
```text
|
||||
[Browser @ chat.bhatfamily.in]
|
||||
|
|
||||
| HTTPS (terminated externally)
|
||||
v
|
||||
[Host reverse proxy (external to this repo)]
|
||||
|
|
||||
| HTTP -> localhost:3000
|
||||
v
|
||||
[chat-ui container: Open WebUI]
|
||||
|
|
||||
| HTTP (docker internal network)
|
||||
v
|
||||
[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
|
||||
|
|
||||
| reads model weights/cache
|
||||
v
|
||||
[Hugging Face cache + local models dir]
|
||||
|
|
||||
| ROCm runtime
|
||||
v
|
||||
[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]
|
||||
```
|
||||
|
||||
## Services
|
||||
### `gemma3-vllm`
|
||||
- Image: `vllm/vllm-openai-rocm:latest`
|
||||
- Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
|
||||
- Host port mapping: `${BACKEND_PORT}:8000` (default `8000:8000`)
|
||||
- Device passthrough:
|
||||
- `/dev/kfd`
|
||||
- `/dev/dri`
|
||||
- Security/capabilities for ROCm debugging compatibility:
|
||||
- `cap_add: SYS_PTRACE`
|
||||
- `security_opt: seccomp=unconfined`
|
||||
- `group_add: video`
|
||||
|
||||
### `chat-ui`
|
||||
- Image: `ghcr.io/open-webui/open-webui:main`
|
||||
- Purpose: Browser chat experience with local persistence in mounted data directory.
|
||||
- Host port mapping: `${FRONTEND_PORT}:8080` (default `3000:8080`)
|
||||
- Upstream model endpoint on docker network:
|
||||
- `OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1`
|
||||
|
||||
## Networking
|
||||
- Docker Compose default bridge network is used.
|
||||
- `chat-ui` resolves `gemma3-vllm` by service name.
|
||||
- External access is via host ports:
|
||||
- API: `localhost:8000`
|
||||
- UI: `localhost:3000`
|
||||
|
||||
## Storage
|
||||
- Hugging Face cache bind mount:
|
||||
- Host: `${HUGGINGFACE_CACHE_DIR}`
|
||||
- Container: `/root/.cache/huggingface`
|
||||
- Optional local models directory:
|
||||
- Host: `./models`
|
||||
- Container: `/models`
|
||||
- Open WebUI data:
|
||||
- Host: `${OPEN_WEBUI_DATA_DIR}`
|
||||
- Container: `/app/backend/data`
|
||||
|
||||
## Scaling notes
|
||||
This repository is designed for **single-node deployment** on one AMD APU/GPU host.
|
||||
|
||||
For larger deployments later:
|
||||
- Move to dedicated GPUs with larger VRAM.
|
||||
- Use pinned vLLM image tags and explicit engine tuning.
|
||||
- Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
|
||||
- Add request routing, autoscaling, and centralized observability.
|
||||
14
docs/README.md
Normal file
14
docs/README.md
Normal file
@ -0,0 +1,14 @@
|
||||
# Documentation Index
|
||||
This folder contains operational and lifecycle documentation for the `gemma3-vllm-stack` repository.
|
||||
|
||||
## Files
|
||||
- `ARCHITECTURE.md`: Component topology, networking, runtime dependencies, and scaling notes.
|
||||
- `TROUBLESHOOTING.md`: Common failures and copy-paste diagnostics/fixes for ROCm, Docker, vLLM, and UI issues.
|
||||
- `UPGRADE_NOTES.md`: Safe upgrade, rollback, and backup guidance.
|
||||
|
||||
## Recommended reading order
|
||||
1. `ARCHITECTURE.md`
|
||||
2. `TROUBLESHOOTING.md`
|
||||
3. `UPGRADE_NOTES.md`
|
||||
|
||||
For quick start and day-1 usage, use the repository root `README.md`.
|
||||
172
docs/TROUBLESHOOTING.md
Normal file
172
docs/TROUBLESHOOTING.md
Normal file
@ -0,0 +1,172 @@
|
||||
# Troubleshooting
|
||||
## ROCm devices not visible in host
|
||||
Symptoms:
|
||||
- `/dev/kfd` missing
|
||||
- `/dev/dri` missing
|
||||
- vLLM fails to start with ROCm device errors
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
ls -l /dev/kfd /dev/dri
|
||||
id
|
||||
getent group video
|
||||
```
|
||||
|
||||
Expected:
|
||||
- `/dev/kfd` exists
|
||||
- `/dev/dri` directory exists
|
||||
- user belongs to `video` group
|
||||
|
||||
Fixes:
|
||||
```bash
|
||||
sudo usermod -aG video "$USER"
|
||||
newgrp video
|
||||
```
|
||||
Then verify ROCm tools:
|
||||
```bash
|
||||
rocminfo | sed -n '1,120p'
|
||||
```
|
||||
If ROCm is not healthy, fix host ROCm installation first.
|
||||
|
||||
---
|
||||
|
||||
## Docker and Compose not available
|
||||
Symptoms:
|
||||
- `docker: command not found`
|
||||
- `docker compose version` fails
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker --version
|
||||
docker compose version
|
||||
```
|
||||
|
||||
Fix using install script (Ubuntu):
|
||||
```bash
|
||||
./scripts/install.sh
|
||||
```
|
||||
Manual fallback:
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y ca-certificates curl gnupg
|
||||
sudo install -m 0755 -d /etc/apt/keyrings
|
||||
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
||||
sudo chmod a+r /etc/apt/keyrings/docker.gpg
|
||||
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
||||
sudo usermod -aG docker "$USER"
|
||||
```
|
||||
Log out/in after group change.
|
||||
|
||||
---
|
||||
|
||||
## vLLM container exits or fails healthchecks
|
||||
Symptoms:
|
||||
- `gemma3-vllm` restarting
|
||||
- API endpoint unavailable
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker compose ps
|
||||
docker compose logs --tail=200 gemma3-vllm
|
||||
```
|
||||
|
||||
Common causes and fixes:
|
||||
1. Missing/invalid Hugging Face token:
|
||||
```bash
|
||||
grep -E '^(HF_TOKEN|GEMMA_MODEL_ID)=' .env
|
||||
```
|
||||
Ensure `HF_TOKEN` is set to a valid token with access to Gemma 3.
|
||||
|
||||
2. Model ID typo:
|
||||
```bash
|
||||
grep '^GEMMA_MODEL_ID=' .env
|
||||
```
|
||||
Use a valid model, e.g. `google/gemma-3-1b-it`.
|
||||
|
||||
3. ROCm runtime/device issues:
|
||||
```bash
|
||||
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:22.04 bash -lc 'ls -l /dev/kfd /dev/dri'
|
||||
```
|
||||
|
||||
4. API key mismatch between backend and UI/tests:
|
||||
```bash
|
||||
grep -E '^(VLLM_API_KEY|OPENAI_API_BASE_URL)=' .env frontend/config/frontend.env 2>/dev/null || true
|
||||
```
|
||||
Keep keys consistent.
|
||||
|
||||
---
|
||||
|
||||
## Out-of-memory (OOM) or low VRAM errors
|
||||
Symptoms:
|
||||
- startup failure referencing memory allocation
|
||||
- runtime generation failures
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker compose logs --tail=300 gemma3-vllm | grep -Ei 'out of memory|oom|memory|cuda|hip|rocm'
|
||||
```
|
||||
|
||||
Mitigations:
|
||||
1. Reduce context length in `.env`:
|
||||
```bash
|
||||
VLLM_MAX_MODEL_LEN=2048
|
||||
```
|
||||
2. Lower GPU memory utilization target:
|
||||
```bash
|
||||
VLLM_GPU_MEMORY_UTILIZATION=0.75
|
||||
```
|
||||
3. Use a smaller Gemma 3 variant in `.env`.
|
||||
4. Restart stack:
|
||||
```bash
|
||||
./scripts/restart.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## UI loads but cannot reach vLLM backend
|
||||
Symptoms:
|
||||
- Browser opens UI but chat requests fail.
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker compose ps
|
||||
docker compose logs --tail=200 chat-ui
|
||||
docker compose logs --tail=200 gemma3-vllm
|
||||
```
|
||||
|
||||
Verify frontend backend URL:
|
||||
```bash
|
||||
grep -E '^OPENAI_API_BASE_URL=' frontend/config/frontend.env
|
||||
```
|
||||
Expected value:
|
||||
```text
|
||||
OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1
|
||||
```
|
||||
|
||||
Verify API directly from host:
|
||||
```bash
|
||||
./scripts/test_api.sh
|
||||
```
|
||||
|
||||
If API works from host but not UI, recreate frontend:
|
||||
```bash
|
||||
docker compose up -d --force-recreate chat-ui
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Health checks and endpoint validation
|
||||
Run all smoke tests:
|
||||
```bash
|
||||
./scripts/test_api.sh
|
||||
./scripts/test_ui.sh
|
||||
python3 scripts/test_python_client.py
|
||||
```
|
||||
|
||||
If one fails, inspect corresponding service logs and then restart:
|
||||
```bash
|
||||
docker compose logs --tail=200 gemma3-vllm chat-ui
|
||||
./scripts/restart.sh
|
||||
```
|
||||
50
docs/UPGRADE_NOTES.md
Normal file
50
docs/UPGRADE_NOTES.md
Normal file
@ -0,0 +1,50 @@
|
||||
# Upgrade Notes
|
||||
## Standard safe upgrade path
|
||||
From repository root:
|
||||
```bash
|
||||
git pull
|
||||
docker compose pull
|
||||
./scripts/restart.sh
|
||||
```
|
||||
Then run smoke tests:
|
||||
```bash
|
||||
./scripts/test_api.sh
|
||||
./scripts/test_ui.sh
|
||||
python3 scripts/test_python_client.py
|
||||
```
|
||||
|
||||
## Versioning guidance
|
||||
- Prefer pinning image tags in `docker-compose.yml` once your deployment is stable.
|
||||
- Upgrading vLLM may change runtime defaults or engine behavior; check vLLM release notes before major version jumps.
|
||||
- Keep `GEMMA_MODEL_ID` explicit in `.env` to avoid unintentional model drift.
|
||||
|
||||
## Model upgrade considerations
|
||||
When changing Gemma 3 variants (for example, from 1B to larger sizes):
|
||||
- Verify host RAM and GPU memory capacity.
|
||||
- Expect re-download of model weights and larger disk usage.
|
||||
- Re-tune:
|
||||
- `VLLM_MAX_MODEL_LEN`
|
||||
- `VLLM_GPU_MEMORY_UTILIZATION`
|
||||
- Re-run validation scripts after restart.
|
||||
|
||||
## Backup recommendations
|
||||
Before major upgrades, back up local persistent data:
|
||||
```bash
|
||||
mkdir -p backups
|
||||
tar -czf backups/hf-cache-$(date +%Y%m%d-%H%M%S).tar.gz "${HOME}/.cache/huggingface"
|
||||
tar -czf backups/open-webui-data-$(date +%Y%m%d-%H%M%S).tar.gz frontend/data/open-webui
|
||||
```
|
||||
If you use local predownloaded models:
|
||||
```bash
|
||||
tar -czf backups/models-$(date +%Y%m%d-%H%M%S).tar.gz models
|
||||
```
|
||||
|
||||
## Rollback approach
|
||||
If a new image/model combination fails:
|
||||
1. Revert `docker-compose.yml` and `.env` to previous known-good values.
|
||||
2. Pull previous pinned images (if pinned by tag/digest).
|
||||
3. Restart:
|
||||
```bash
|
||||
./scripts/restart.sh
|
||||
```
|
||||
4. Re-run smoke tests.
|
||||
Reference in New Issue
Block a user