Initial production-ready Gemma 3 vLLM ROCm stack
Co-Authored-By: Oz <oz-agent@warp.dev>
This commit is contained in:
172
docs/TROUBLESHOOTING.md
Normal file
172
docs/TROUBLESHOOTING.md
Normal file
@ -0,0 +1,172 @@
|
||||
# Troubleshooting
|
||||
## ROCm devices not visible in host
|
||||
Symptoms:
|
||||
- `/dev/kfd` missing
|
||||
- `/dev/dri` missing
|
||||
- vLLM fails to start with ROCm device errors
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
ls -l /dev/kfd /dev/dri
|
||||
id
|
||||
getent group video
|
||||
```
|
||||
|
||||
Expected:
|
||||
- `/dev/kfd` exists
|
||||
- `/dev/dri` directory exists
|
||||
- user belongs to `video` group
|
||||
|
||||
Fixes:
|
||||
```bash
|
||||
sudo usermod -aG video "$USER"
|
||||
newgrp video
|
||||
```
|
||||
Then verify ROCm tools:
|
||||
```bash
|
||||
rocminfo | sed -n '1,120p'
|
||||
```
|
||||
If ROCm is not healthy, fix host ROCm installation first.
|
||||
|
||||
---
|
||||
|
||||
## Docker and Compose not available
|
||||
Symptoms:
|
||||
- `docker: command not found`
|
||||
- `docker compose version` fails
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker --version
|
||||
docker compose version
|
||||
```
|
||||
|
||||
Fix using install script (Ubuntu):
|
||||
```bash
|
||||
./scripts/install.sh
|
||||
```
|
||||
Manual fallback:
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y ca-certificates curl gnupg
|
||||
sudo install -m 0755 -d /etc/apt/keyrings
|
||||
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
||||
sudo chmod a+r /etc/apt/keyrings/docker.gpg
|
||||
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
||||
sudo usermod -aG docker "$USER"
|
||||
```
|
||||
Log out/in after group change.
|
||||
|
||||
---
|
||||
|
||||
## vLLM container exits or fails healthchecks
|
||||
Symptoms:
|
||||
- `gemma3-vllm` restarting
|
||||
- API endpoint unavailable
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker compose ps
|
||||
docker compose logs --tail=200 gemma3-vllm
|
||||
```
|
||||
|
||||
Common causes and fixes:
|
||||
1. Missing/invalid Hugging Face token:
|
||||
```bash
|
||||
grep -E '^(HF_TOKEN|GEMMA_MODEL_ID)=' .env
|
||||
```
|
||||
Ensure `HF_TOKEN` is set to a valid token with access to Gemma 3.
|
||||
|
||||
2. Model ID typo:
|
||||
```bash
|
||||
grep '^GEMMA_MODEL_ID=' .env
|
||||
```
|
||||
Use a valid model, e.g. `google/gemma-3-1b-it`.
|
||||
|
||||
3. ROCm runtime/device issues:
|
||||
```bash
|
||||
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:22.04 bash -lc 'ls -l /dev/kfd /dev/dri'
|
||||
```
|
||||
|
||||
4. API key mismatch between backend and UI/tests:
|
||||
```bash
|
||||
grep -E '^(VLLM_API_KEY|OPENAI_API_BASE_URL)=' .env frontend/config/frontend.env 2>/dev/null || true
|
||||
```
|
||||
Keep keys consistent.
|
||||
|
||||
---
|
||||
|
||||
## Out-of-memory (OOM) or low VRAM errors
|
||||
Symptoms:
|
||||
- startup failure referencing memory allocation
|
||||
- runtime generation failures
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker compose logs --tail=300 gemma3-vllm | grep -Ei 'out of memory|oom|memory|cuda|hip|rocm'
|
||||
```
|
||||
|
||||
Mitigations:
|
||||
1. Reduce context length in `.env`:
|
||||
```bash
|
||||
VLLM_MAX_MODEL_LEN=2048
|
||||
```
|
||||
2. Lower GPU memory utilization target:
|
||||
```bash
|
||||
VLLM_GPU_MEMORY_UTILIZATION=0.75
|
||||
```
|
||||
3. Use a smaller Gemma 3 variant in `.env`.
|
||||
4. Restart stack:
|
||||
```bash
|
||||
./scripts/restart.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## UI loads but cannot reach vLLM backend
|
||||
Symptoms:
|
||||
- Browser opens UI but chat requests fail.
|
||||
|
||||
Checks:
|
||||
```bash
|
||||
docker compose ps
|
||||
docker compose logs --tail=200 chat-ui
|
||||
docker compose logs --tail=200 gemma3-vllm
|
||||
```
|
||||
|
||||
Verify frontend backend URL:
|
||||
```bash
|
||||
grep -E '^OPENAI_API_BASE_URL=' frontend/config/frontend.env
|
||||
```
|
||||
Expected value:
|
||||
```text
|
||||
OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1
|
||||
```
|
||||
|
||||
Verify API directly from host:
|
||||
```bash
|
||||
./scripts/test_api.sh
|
||||
```
|
||||
|
||||
If API works from host but not UI, recreate frontend:
|
||||
```bash
|
||||
docker compose up -d --force-recreate chat-ui
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Health checks and endpoint validation
|
||||
Run all smoke tests:
|
||||
```bash
|
||||
./scripts/test_api.sh
|
||||
./scripts/test_ui.sh
|
||||
python3 scripts/test_python_client.py
|
||||
```
|
||||
|
||||
If one fails, inspect corresponding service logs and then restart:
|
||||
```bash
|
||||
docker compose logs --tail=200 gemma3-vllm chat-ui
|
||||
./scripts/restart.sh
|
||||
```
|
||||
Reference in New Issue
Block a user