Files
gemma3-vllm-stack/docs/ARCHITECTURE.md
2026-04-18 22:53:46 +05:30

2.2 KiB

Architecture

Component flow

[Browser @ chat.bhatfamily.in]
          |
          | HTTPS (terminated externally)
          v
[Host reverse proxy (external to this repo)]
          |
          | HTTP -> localhost:3000
          v
[chat-ui container: Open WebUI]
          |
          | HTTP (docker internal network)
          v
[gemma3-vllm container: vLLM OpenAI API @ :8000/v1]
          |
          | reads model weights/cache
          v
[Hugging Face cache + local models dir]
          |
          | ROCm runtime
          v
[AMD Radeon 780M (RDNA3 iGPU) via /dev/kfd + /dev/dri]

Services

gemma3-vllm

  • Image: vllm/vllm-openai-rocm:latest
  • Purpose: Run Gemma 3 instruction model through OpenAI-compatible API.
  • Host port mapping: ${BACKEND_PORT}:8000 (default 8000:8000)
  • Device passthrough:
    • /dev/kfd
    • /dev/dri
  • Security/capabilities for ROCm debugging compatibility:
    • cap_add: SYS_PTRACE
    • security_opt: seccomp=unconfined
    • group_add: video

chat-ui

  • Image: ghcr.io/open-webui/open-webui:main
  • Purpose: Browser chat experience with local persistence in mounted data directory.
  • Host port mapping: ${FRONTEND_PORT}:8080 (default 3000:8080)
  • Upstream model endpoint on docker network:
    • OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1

Networking

  • Docker Compose default bridge network is used.
  • chat-ui resolves gemma3-vllm by service name.
  • External access is via host ports:
    • API: localhost:8000
    • UI: localhost:3000

Storage

  • Hugging Face cache bind mount:
    • Host: ${HUGGINGFACE_CACHE_DIR}
    • Container: /root/.cache/huggingface
  • Optional local models directory:
    • Host: ./models
    • Container: /models
  • Open WebUI data:
    • Host: ${OPEN_WEBUI_DATA_DIR}
    • Container: /app/backend/data

Scaling notes

This repository is designed for single-node deployment on one AMD APU/GPU host.

For larger deployments later:

  • Move to dedicated GPUs with larger VRAM.
  • Use pinned vLLM image tags and explicit engine tuning.
  • Consider externalized model storage and distributed orchestration (Kubernetes/Swarm/Nomad).
  • Add request routing, autoscaling, and centralized observability.