Files

Raghav ef8537e923 Initial production-ready Gemma 3 vLLM ROCm stack

Co-Authored-By: Oz <oz-agent@warp.dev>

2026-04-18 22:53:46 +05:30

2.2 KiB

Raw Blame History

Architecture

Component flow

[Browser @ chat.bhatfamily.in]
          |
          | HTTPS (terminated externally)
          v
[Host reverse proxy (external to this repo)]
          |
          | HTTP -> localhost:3000
          v
[chat-ui container: Open WebUI]
          |
          | HTTP (docker internal network)
          v
[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
          |
          | reads model weights/cache
          v
[Hugging Face cache + local models dir]
          |
          | ROCm runtime
          v
[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]

Services

`gemma3-vllm`

Image: vllm/vllm-openai-rocm:latest
Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
Host port mapping: ${BACKEND_PORT}:8000 (default 8000:8000)
Device passthrough:
- /dev/kfd
- /dev/dri
Security/capabilities for ROCm debugging compatibility:
- cap_add: SYS_PTRACE
- security_opt: seccomp=unconfined
- group_add: video

`chat-ui`

Image: ghcr.io/open-webui/open-webui:main
Purpose: Browser chat experience with local persistence in mounted data directory.
Host port mapping: ${FRONTEND_PORT}:8080 (default 3000:8080)
Upstream model endpoint on docker network:
- OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1

Networking

Docker Compose default bridge network is used.
chat-ui resolves gemma3-vllm by service name.
External access is via host ports:
- API: localhost:8000
- UI: localhost:3000

Storage

Hugging Face cache bind mount:
- Host: ${HUGGINGFACE_CACHE_DIR}
- Container: /root/.cache/huggingface
Optional local models directory:
- Host: ./models
- Container: /models
Open WebUI data:
- Host: ${OPEN_WEBUI_DATA_DIR}
- Container: /app/backend/data

Scaling notes

This repository is designed for single-node deployment on one AMD APU/GPU host.

For larger deployments later:

Move to dedicated GPUs with larger VRAM.
Use pinned vLLM image tags and explicit engine tuning.
Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
Add request routing, autoscaling, and centralized observability.

2.2 KiB Raw Blame History

Architecture

Component flow

Services

gemma3-vllm

chat-ui

Networking

Storage

Scaling notes

2.2 KiB

Raw Blame History

`gemma3-vllm`

`chat-ui`