gemma3-vllm-stack/README.md

# gemma3-vllm-stack
Production-ready self-hosted stack for running **Gemma 3** with **vLLM** on AMD ROCm, plus a browser chat UI suitable for publishing at `chat.bhatfamily.in`.

## What this stack provides
- Dockerized **vLLM OpenAI-compatible API** (`/v1`) backed by Gemma 3 on ROCm.
- Dockerized **Open WebUI** chat frontend connected to the local vLLM endpoint.
- Non-interactive scripts for install, restart, uninstall, and smoke testing.
- Documentation for operations, upgrades, and troubleshooting.

## Repository layout
```text
gemma3-vllm-stack/
├── .env.example
├── .gitignore
├── docker-compose.yml
├── README.md
├── backend/
│   ├── Dockerfile
│   └── config/
│       └── model.env.example
├── frontend/
│   ├── Dockerfile
│   └── config/
│       └── frontend.env.example
├── scripts/
│   ├── install.sh
│   ├── restart.sh
│   ├── test_api.sh
│   ├── test_python_client.py
│   ├── test_ui.sh
│   └── uninstall.sh
└── docs/
    ├── ARCHITECTURE.md
    ├── TROUBLESHOOTING.md
    └── UPGRADE_NOTES.md
```

## Architecture summary
- `gemma3-vllm` service runs `vllm/vllm-openai-rocm` and exposes `http://localhost:${BACKEND_PORT}/v1`.
- `chat-ui` service runs Open WebUI and exposes `http://localhost:${FRONTEND_PORT}`.
- Open WebUI calls `http://gemma3-vllm:8000/v1` on the internal Docker network.

Detailed architecture: `docs/ARCHITECTURE.md`.

## Prerequisites
- Ubuntu 22.04 LTS (amd64)
- AMD ROCm-compatible GPU setup with:
  - `/dev/kfd`
  - `/dev/dri`
- Docker Engine and docker compose plugin (script auto-installs on Ubuntu if missing)
- Hugging Face token with access to Gemma 3 model (set as `HF_TOKEN`)

## Quickstart
1. Clone from your Gitea server:
   ```bash
   git clone ssh://git@git.bhatfamily.in/rbhat/gemma3-vllm-stack.git
   cd gemma3-vllm-stack
   ```

2. Create the main configuration file:
   ```bash
   cp .env.example .env
   ```

3. Edit `.env` and set at least:
   - `HF_TOKEN`
   - `VLLM_API_KEY` (recommended even on LAN)
   - `GEMMA_MODEL_ID`

   `backend/config/model.env` and `frontend/config/frontend.env` are auto-synced from `.env` by `scripts/install.sh` and `scripts/restart.sh`.

4. Install/start stack:
   ```bash
   ./scripts/install.sh
   ```

5. Run smoke tests:
   ```bash
   ./scripts/test_api.sh
   ./scripts/test_ui.sh
   python3 scripts/test_python_client.py
   ```

6. Open browser:
   - `http://localhost:3000`
   - Reverse proxy externally to `https://chat.bhatfamily.in`

## Operations
- Restart stack:
  ```bash
  ./scripts/restart.sh
  ```

- View logs:
  ```bash
  docker compose logs --tail=200 gemma3-vllm chat-ui
  ```

- Stop and remove stack resources:
  ```bash
  ./scripts/uninstall.sh
  ```

- Stop/remove stack and purge local cache/model/UI data:
  ```bash
  ./scripts/uninstall.sh --purge
  ```

## Upgrade workflow
```bash
git pull
docker compose pull
./scripts/restart.sh
```
More details: `docs/UPGRADE_NOTES.md`.

## Default endpoints
- API base URL: `http://localhost:8000/v1`
- UI URL: `http://localhost:3000`

Adjust using `.env`:
- `BACKEND_PORT`
- `FRONTEND_PORT`
- `GEMMA_MODEL_ID`

## Notes for `chat.bhatfamily.in`
This repository intentionally does not terminate TLS. Bindings are plain HTTP on host ports and are designed for external reverse proxy + TLS handling (nginx/Caddy/Cloudflare Tunnel).

Current homelab edge mapping (verified 2026-04-19):
- `https://chat.bhatfamily.in` is served on `443/tcp` by the shared Caddy edge.
- Direct host ports `3000/tcp` (UI) and `8000/tcp` (API) are currently publicly reachable; restrict firewall/NAT exposure if direct internet access is not intended.