# Troubleshooting ## ROCm devices not visible in host Symptoms: - `/dev/kfd` missing - `/dev/dri` missing - vLLM fails to start with ROCm device errors Checks: ```bash ls -l /dev/kfd /dev/dri id getent group video ``` Expected: - `/dev/kfd` exists - `/dev/dri` directory exists - user belongs to `video` group Fixes: ```bash sudo usermod -aG video "$USER" newgrp video ``` Then verify ROCm tools: ```bash rocminfo | sed -n '1,120p' ``` If ROCm is not healthy, fix host ROCm installation first. --- ## Docker and Compose not available Symptoms: - `docker: command not found` - `docker compose version` fails Checks: ```bash docker --version docker compose version ``` Fix using install script (Ubuntu): ```bash ./scripts/install.sh ``` Manual fallback: ```bash sudo apt-get update sudo apt-get install -y ca-certificates curl gnupg sudo install -m 0755 -d /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg sudo chmod a+r /etc/apt/keyrings/docker.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null sudo apt-get update sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin sudo usermod -aG docker "$USER" ``` Log out/in after group change. --- ## vLLM container exits or fails healthchecks Symptoms: - `gemma3-vllm` restarting - API endpoint unavailable Checks: ```bash docker compose ps docker compose logs --tail=200 gemma3-vllm ``` Common causes and fixes: 1. Missing/invalid Hugging Face token: ```bash grep -E '^(HF_TOKEN|GEMMA_MODEL_ID)=' .env ``` Ensure `HF_TOKEN` is set to a valid token with access to Gemma 3. 2. Model ID typo: ```bash grep '^GEMMA_MODEL_ID=' .env ``` Use a valid model, e.g. `google/gemma-3-1b-it`. 3. ROCm runtime/device issues: ```bash docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:22.04 bash -lc 'ls -l /dev/kfd /dev/dri' ``` 4. API key mismatch between backend and UI/tests: ```bash grep -E '^(VLLM_API_KEY|OPENAI_API_BASE_URL)=' .env frontend/config/frontend.env 2>/dev/null || true ``` Keep keys consistent. --- ## Out-of-memory (OOM) or low VRAM errors Symptoms: - startup failure referencing memory allocation - runtime generation failures Checks: ```bash docker compose logs --tail=300 gemma3-vllm | grep -Ei 'out of memory|oom|memory|cuda|hip|rocm' ``` Mitigations: 1. Reduce context length in `.env`: ```bash VLLM_MAX_MODEL_LEN=2048 ``` 2. Lower GPU memory utilization target: ```bash VLLM_GPU_MEMORY_UTILIZATION=0.75 ``` 3. Use a smaller Gemma 3 variant in `.env`. 4. Restart stack: ```bash ./scripts/restart.sh ``` --- ## UI loads but cannot reach vLLM backend Symptoms: - Browser opens UI but chat requests fail. Checks: ```bash docker compose ps docker compose logs --tail=200 chat-ui docker compose logs --tail=200 gemma3-vllm ``` Verify frontend backend URL: ```bash grep -E '^OPENAI_API_BASE_URL=' frontend/config/frontend.env ``` Expected value: ```text OPENAI_API_BASE_URL=http://gemma3-vllm:8000/v1 ``` Verify API directly from host: ```bash ./scripts/test_api.sh ``` If API works from host but not UI, recreate frontend: ```bash docker compose up -d --force-recreate chat-ui ``` --- ## Health checks and endpoint validation Run all smoke tests: ```bash ./scripts/test_api.sh ./scripts/test_ui.sh python3 scripts/test_python_client.py ``` If one fails, inspect corresponding service logs and then restart: ```bash docker compose logs --tail=200 gemma3-vllm chat-ui ./scripts/restart.sh ```