173 lines
3.6 KiB
Markdown
173 lines
3.6 KiB
Markdown
# Troubleshooting
|
|
## ROCm devices not visible in host
|
|
Symptoms:
|
|
- `/dev/kfd` missing
|
|
- `/dev/dri` missing
|
|
- vLLM fails to start with ROCm device errors
|
|
|
|
Checks:
|
|
```bash
|
|
ls -l /dev/kfd /dev/dri
|
|
id
|
|
getent group video
|
|
```
|
|
|
|
Expected:
|
|
- `/dev/kfd` exists
|
|
- `/dev/dri` directory exists
|
|
- user belongs to `video` group
|
|
|
|
Fixes:
|
|
```bash
|
|
sudo usermod -aG video "$USER"
|
|
newgrp video
|
|
```
|
|
Then verify ROCm tools:
|
|
```bash
|
|
rocminfo | sed -n '1,120p'
|
|
```
|
|
If ROCm is not healthy, fix host ROCm installation first.
|
|
|
|
---
|
|
|
|
## Docker and Compose not available
|
|
Symptoms:
|
|
- `docker: command not found`
|
|
- `docker compose version` fails
|
|
|
|
Checks:
|
|
```bash
|
|
docker --version
|
|
docker compose version
|
|
```
|
|
|
|
Fix using install script (Ubuntu):
|
|
```bash
|
|
./scripts/install.sh
|
|
```
|
|
Manual fallback:
|
|
```bash
|
|
sudo apt-get update
|
|
sudo apt-get install -y ca-certificates curl gnupg
|
|
sudo install -m 0755 -d /etc/apt/keyrings
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
|
sudo chmod a+r /etc/apt/keyrings/docker.gpg
|
|
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
|
|
sudo apt-get update
|
|
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
|
sudo usermod -aG docker "$USER"
|
|
```
|
|
Log out/in after group change.
|
|
|
|
---
|
|
|
|
## vLLM container exits or fails healthchecks
|
|
Symptoms:
|
|
- `gemma3-vllm` restarting
|
|
- API endpoint unavailable
|
|
|
|
Checks:
|
|
```bash
|
|
docker compose ps
|
|
docker compose logs --tail=200 gemma3-vllm
|
|
```
|
|
|
|
Common causes and fixes:
|
|
1. Missing/invalid Hugging Face token:
|
|
```bash
|
|
grep -E '^(HF_TOKEN|GEMMA_MODEL_ID)=' .env
|
|
```
|
|
Ensure `HF_TOKEN` is set to a valid token with access to Gemma 3.
|
|
|
|
2. Model ID typo:
|
|
```bash
|
|
grep '^GEMMA_MODEL_ID=' .env
|
|
```
|
|
Use a valid model, e.g. `google/gemma-3-1b-it`.
|
|
|
|
3. ROCm runtime/device issues:
|
|
```bash
|
|
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:22.04 bash -lc 'ls -l /dev/kfd /dev/dri'
|
|
```
|
|
|
|
4. API key mismatch between backend and UI/tests:
|
|
```bash
|
|
grep -E '^(VLLM_API_KEY|OPENAI_API_BASE_URL)=' .env frontend/config/frontend.env 2>/dev/null || true
|
|
```
|
|
Keep keys consistent.
|
|
|
|
---
|
|
|
|
## Out-of-memory (OOM) or low VRAM errors
|
|
Symptoms:
|
|
- startup failure referencing memory allocation
|
|
- runtime generation failures
|
|
|
|
Checks:
|
|
```bash
|
|
docker compose logs --tail=300 gemma3-vllm | grep -Ei 'out of memory|oom|memory|cuda|hip|rocm'
|
|
```
|
|
|
|
Mitigations:
|
|
1. Reduce context length in `.env`:
|
|
```bash
|
|
VLLM_MAX_MODEL_LEN=2048
|
|
```
|
|
2. Lower GPU memory utilization target:
|
|
```bash
|
|
VLLM_GPU_MEMORY_UTILIZATION=0.75
|
|
```
|
|
3. Use a smaller Gemma 3 variant in `.env`.
|
|
4. Restart stack:
|
|
```bash
|
|
./scripts/restart.sh
|
|
```
|
|
|
|
---
|
|
|
|
## UI loads but cannot reach vLLM backend
|
|
Symptoms:
|
|
- Browser opens UI but chat requests fail.
|
|
|
|
Checks:
|
|
```bash
|
|
docker compose ps
|
|
docker compose logs --tail=200 chat-ui
|
|
docker compose logs --tail=200 gemma3-vllm
|
|
```
|
|
|
|
Verify frontend backend URL:
|
|
```bash
|
|
grep -E '^OPENAI_API_BASE_URL=' frontend/config/frontend.env
|
|
```
|
|
Expected value:
|
|
```text
|
|
OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1
|
|
```
|
|
|
|
Verify API directly from host:
|
|
```bash
|
|
./scripts/test_api.sh
|
|
```
|
|
|
|
If API works from host but not UI, recreate frontend:
|
|
```bash
|
|
docker compose up -d --force-recreate chat-ui
|
|
```
|
|
|
|
---
|
|
|
|
## Health checks and endpoint validation
|
|
Run all smoke tests:
|
|
```bash
|
|
./scripts/test_api.sh
|
|
./scripts/test_ui.sh
|
|
python3 scripts/test_python_client.py
|
|
```
|
|
|
|
If one fails, inspect corresponding service logs and then restart:
|
|
```bash
|
|
docker compose logs --tail=200 gemma3-vllm chat-ui
|
|
./scripts/restart.sh
|
|
```
|