Server prerequisites¶
Recommended sizing to run Myeline with an interactive experience (perceived RAG response under 5 s).
Key takeaway
A single RAG query saturates 2-4 cores for 0.5-2 s for embedding + 4-8 cores for 5-30 s for local synthesis. 8 cores is the practical floor; 4 cores cannot sustain interactive use.
TL;DR by profile¶
| Profile | vCPU | RAM | Disk | GPU | Notes |
|---|---|---|---|---|---|
| Demo / 1 user | 8 | 16 GB | 50 GB SSD | – | Local CPU embedding |
| Sovereign ≤ 20 users | 16 | 32 GB | 200 GB NVMe | – | Mistral-Nemo CPU = 15-40 s/query |
| Sovereign ≤ 200 users | 24 | 64 GB | 500 GB NVMe | RTX 4090 24 GB or L40S | GPU strongly recommended |
| Sovereign large / Llama 70B | 32 | 128 GB | 1 TB NVMe | 2× L40S 48 GB | Llama 3.1 70B Q4 or Mixtral 8×7B |
| Sovereign-hybrid ≤ 100 users | 8 | 16 GB | 100 GB SSD | – | Synthesis offloaded to BYOK API, embedding stays local |
Detailed consumption¶
CPU¶
| Load | Demand |
|---|---|
| Web + worker (idle) | 1-2 cores |
| Embedding (bge-m3 CPU) | 2-4 cores at 100 % for 0.5-2 s per query |
| ChromaDB HNSW search | 1-2 cores for ~100 ms |
| External API synthesis (sovereign-hybrid) | 0 (network-bound) |
| Ollama CPU LLM synthesis | 4-8 cores at 100 % for 5-30 s |
| MariaDB | 1 core (2+ at peak) |
| Cron (most jobs < 30 s) | bursts only |
Single-user reality: even an idle server saturates 4 cores during the request (embedding + HNSW + synthesis chained).
Multi-user reality: with 8 cores you handle 1-2 active concurrent requests comfortably; beyond that they queue. For 4+ active concurrent users plan 16+ cores.
Memory (sovereign — local LLM)¶
Add the chosen Ollama model:
| Local model | Quant. | RAM resident | Notes |
|---|---|---|---|
mistral-nemo (12 B) |
Q4_K_M | ~7 GB | Default. Decent, CPU slow |
mistral-nemo |
Q8 | ~13 GB | Better quality |
mixtral-8x7b |
Q4 | ~26 GB | CPU ≥ 30 s/answer, GPU recommended |
llama3.1:70b |
Q4 | ~40 GB | Top-tier local, GPU mandatory |
With GPU the models live in VRAM, not system RAM; the figures above remain roughly valid for the global memory budget, but latency improves 5-20× depending on the card.
Disk¶
| Usage | Size | Notes |
|---|---|---|
| OS + container images | 5-10 GB | Python slim image ~600 MB; Ollama models dominate |
| Ollama models | bge-m3 600 MB · mistral-nemo Q4 7 GB | Stored in data/ollama/ |
| ChromaDB | ~10 KB / indexed chunk | 100k chunks ≈ 1 GB; grows linearly |
| MariaDB | 100 MB → 5 GB | Audit log + conversations dominate |
| Uploads | unbounded | Capped per plan (max_file_size) |
| Backups (30 d retention) | 2-5× live data | backup_databases cron |
| Logs | ~10 MB / day | Rotation via Podman / journald |
NVMe vs SATA SSD: better p99 for ChromaDB (HNSW seeks) and MariaDB.
Network¶
- Outbound (sovereign-hybrid only) — AI provider + Brevo + GHCR: ~10 Mbps sustained, more during image pulls.
- Inbound: 100 Mbps comfortable for a few dozen concurrent users.
- Latency to AI provider (Mistral Paris/Frankfurt, Anthropic / OpenAI / Gemini US): aim for < 100 ms.
- In pure sovereign: no outbound traffic (air-gap).
OS and runtime¶
- Linux: Rocky / AlmaLinux 9, Debian 12, Ubuntu 22.04 LTS+
- Containers: Podman 4.6+ rootless recommended, Docker
works too (
docker-composev2) - systemd: required for Podman pods unit-managed (sovereign)
- Python 3.11+ if you run outside containers (official path: containers)
- SELinux Enforcing: OK — the compose file labels volumes with
:z - GPU (sovereign with GPU): NVIDIA drivers 535+, NVIDIA Container Toolkit installed; Ollama auto-detects