Ollama issues¶
Ollama always handles embedding (model bge-m3) and, in pure
sovereign, also synthesis. Ollama failures block the entire RAG
search — it's the most sensitive component.
Quick diagnostic¶
# Loaded models
podman exec ollama ollama list
# Test embedding
curl -s http://localhost:11434/api/embed -d '{"model":"bge-m3","input":"test"}' | jq
# Test synthesis (pure sovereign)
curl -s http://localhost:11434/api/generate -d '{"model":"mistral-nemo","prompt":"Hello","stream":false}' | jq
# GPU state (if applicable)
nvidia-smi
/health on the Myeline side must show "ollama": "ok". If
degraded or down, see below.
Model not found¶
Symptom: 404 from Ollama: model 'bge-m3' not found.
Cause: the model wasn't pulled or was deleted.
Fix:
podman exec ollama ollama pull bge-m3
podman exec ollama ollama pull mistral-nemo:latest # sovereign synthesis
Configure pull at startup via entrypoint or in
docker-compose.yml:
ollama:
image: ollama/ollama:latest
command: |
sh -c "
/bin/ollama serve &
sleep 5
/bin/ollama pull bge-m3
/bin/ollama pull mistral-nemo
wait
"
OOM (Out of Memory)¶
Symptom: Ollama kills the process, container restarts in a loop,
logs show OOMKilled.
Causes:
- Model too big for available RAM/VRAM (Mixtral 8×7B = 47 GB, Llama 3.1 70B = 40 GB in Q4).
- Several models loaded in parallel.
Fix:
- Check available VRAM (
nvidia-smi) or RAM (free -h). - Pick a quantised model (
Q4_K_Minstead ofQ8_0). - Set
OLLAMA_NUM_PARALLEL=1to limit to 1 model at a time. - Set
OLLAMA_MAX_LOADED_MODELS=2to auto-unload.
See Server prerequisites for sizing.
Excess latency¶
Symptom: /health degraded, RAG queries > 30 s.
Causes:
- CPU-only on a synthesis model (Mistral-Nemo on CPU = 20-40 s).
- Underpowered GPU (RTX 3060 12 GB for Llama 70B = inevitable CPU swap).
- Saturated disk (models reloaded from slow SSD on every request because they don't fit in RAM).
Fix:
# Ollama startup GPU profile
podman exec ollama env | grep OLLAMA
# OLLAMA_HOST=0.0.0.0:11434
# OLLAMA_NUM_GPU=1 # number of GPUs to use
# OLLAMA_KEEP_ALIVE=24h # keep models loaded
Setting OLLAMA_KEEP_ALIVE=24h avoids repeated reloads. For
CPU-only, accept the latency or add a GPU.
GPU not detected¶
Symptom: despite a card being present, Ollama runs on CPU.
Verification:
# Host side
nvidia-smi
podman info | grep -i nvidia
# Test the runtime
podman run --rm --device nvidia.com/gpu=all nvidia/cuda:12.5.0-base-ubuntu22.04 nvidia-smi
Common causes:
nvidia-container-toolkitnot installed / misconfigured.- SELinux denials on device access. Check
sudo ausearch -m AVC -ts recentand adjust the context withchconor a dedicated policy module — do not disable SELinux. - Host driver too old for the CUDA version of the Ollama image.
Verbose logs¶
To debug a session:
Restore OLLAMA_DEBUG=0 afterwards to avoid filling the disk.