Docker Compose AI Full-Stack Deployment: One-Click Orchestration from LLM to Vector Database
Setting Up an AI Dev Environment Still Takes Three Days?
In 2026, AI development environment setup remains a developer's nightmare. You need to install Ollama for LLM serving, configure Qdrant for vector storage, set up embedding services, build an API gateway for authentication, and handle GPU drivers, CUDA versions, and model downloads... Three days gone, and you haven't written a single line of code.
Docker Compose AI full-stack deployment orchestrates everything into a single file. docker compose up -d launches the entire AI stack in minutes. This article is a complete hands-on guide covering 7 core patterns, 5 common pitfalls, 10 error troubleshooting steps, and production hardening strategies.
Key Takeaways
- Docker Compose AI full-stack deployment = LLM + Vector DB + Embedding + API Gateway + Monitoring, one file to rule them all
- Ollama + OpenWebUI is the most mature local LLM serving solution
- Qdrant/Milvus are the go-to vector databases with dead-simple Docker deployment
- GPU passthrough is critical for AI deployment — configure via
deploy.resources.reservations.devices - Production requires authentication, rate limiting, monitoring, and backup hardening
Table of Contents
- AI Full-Stack Architecture Overview
- Pattern 1: Ollama + OpenWebUI LLM Serving
- Pattern 2: Qdrant/Milvus Vector Databases
- Pattern 3: Embedding Services and Model Management
- Pattern 4: API Gateway and Authentication
- Pattern 5: GPU Passthrough and Resource Limits
- Pattern 6: Monitoring and Observability
- Pattern 7: Production Hardening and Security
- 5 Common Pitfalls and Solutions
- 10 Common Error Troubleshooting
- Advanced Optimization Tips
- Comparison: Docker Compose vs K8s vs Docker Swarm
- Recommended Online Tools
- Summary
AI Full-Stack Architecture Overview
Docker Compose AI full-stack deployment is built on a 7-layer architecture, from GPU at the bottom to the API gateway at the top:
┌─────────────────────────────────────────────────────┐
│ API Gateway │
│ (Traefik / Nginx) │
│ Auth · Rate Limit · Routing · TLS │
├──────────┬──────────┬──────────┬────────────────────┤
│ OpenWebUI│ RAG App │ Agent │ Admin Panel │
│ (Chat) │ (Search) │ (Proxy) │ (Management) │
├──────────┴──────────┴──────────┴────────────────────┤
│ Embedding Service │
│ (TEI / Infinity / FastEmbed) │
├──────────────────┬──────────────────────────────────┤
│ Ollama LLM │ vLLM / TGI │
│ (Model Serving) │ (High-Perf Inference) │
├──────────────────┴──────────────────────────────────┤
│ Vector Database │
│ (Qdrant / Milvus / Weaviate) │
├─────────────────────────────────────────────────────┤
│ Infrastructure │
│ Redis · PostgreSQL · MinIO · Prometheus │
├─────────────────────────────────────────────────────┤
│ GPU / CPU Runtime │
│ NVIDIA CUDA · ROCm · CPU Fallback │
└─────────────────────────────────────────────────────┘
Project directory structure for Docker Compose AI full-stack deployment:
ai-stack/
├── docker-compose.yml
├── docker-compose.gpu.yml
├── docker-compose.prod.yml
├── .env
├── ollama/
│ └── Modelfile
├── qdrant/
│ └── config.yaml
├── traefik/
│ ├── traefik.yml
│ └── acme.json
├── monitoring/
│ ├── prometheus.yml
│ └── grafana/
│ └── dashboards/
└── scripts/
├── init-models.sh
└── backup-vectors.sh
Pattern 1: Ollama + OpenWebUI LLM Serving
Ollama is the most mature local LLM serving solution in 2026, supporting Llama 4, Qwen 3, DeepSeek V3, and other mainstream models. OpenWebUI provides a ChatGPT-style web interface.
Basic Configuration
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
OLLAMA_KEEP_ALIVE: "24h"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "3"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
WEBUI_SECRET_KEY: "${WEBUI_SECRET_KEY}"
ENABLE_SIGNUP: "false"
DEFAULT_USER_ROLE: "user"
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
volumes:
ollama_data:
open_webui_data:
Automatic Model Pulling
After starting Ollama, you need to pull models manually. Automate this with an init script:
#!/bin/bash
# scripts/init-models.sh
MODELS=(
"qwen3:8b"
"llama4:8b"
"deepseek-v3:8b"
"nomic-embed-text"
)
for model in "${MODELS[@]}"; do
echo "Pulling model: $model"
until curl -s http://localhost:11434/api/pull -d "{\"name\":\"$model\"}" | grep -q "success"; do
echo " Retrying $model..."
sleep 5
done
echo " ✓ $model ready"
done
echo "All models pulled successfully!"
Add the init service to Docker Compose:
model-init:
image: curlimages/curl:latest
container_name: model-init
depends_on:
ollama:
condition: service_healthy
volumes:
- ./scripts/init-models.sh:/init-models.sh:ro
entrypoint: ["/bin/sh", "/init-models.sh"]
restart: "no"
Custom Modelfile
# ollama/Modelfile
FROM qwen3:8b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"
SYSTEM """
You are a professional AI assistant. When answering questions:
1. Give a concise conclusion first
2. Then provide detailed explanation
3. If uncertain, say so explicitly
"""
Build a custom model:
docker exec ollama ollama create my-assistant -f /root/.ollama/Modelfile
Pattern 2: Qdrant/Milvus Vector Databases
Vector databases are the core of RAG architecture. Docker Compose AI full-stack deployment typically uses Qdrant (lightweight) or Milvus (large-scale).
Qdrant Configuration (Recommended for Small-Medium Projects)
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_data:/qdrant/storage
- ./qdrant/config.yaml:/qdrant/config/production.yaml:ro
environment:
QDRANT__SERVICE__GRPC_PORT: "6334"
QDRANT__LOG_LEVEL: "INFO"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 3
restart: unless-stopped
Qdrant configuration file:
# qdrant/config.yaml
storage:
performance:
max_search_threads: 4
wal:
wal_capacity_mb: 32
wal_segments_ahead: 0
optimizers:
indexing_threshold: 20000
memmap_threshold: 50000
service:
max_request_size_mb: 64
enable_cors: true
telemetry_disabled: true
Milvus Configuration (Recommended for Large-Scale Projects)
etcd:
image: quay.io/coreos/etcd:v3.5.16
container_name: milvus-etcd
environment:
ETCD_AUTO_COMPACTION_MODE: "revision"
ETCD_AUTO_COMPACTION_RETENTION: "1000"
ETCD_QUOTA_BACKEND_BYTES: "4294967296"
volumes:
- etcd_data:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379
restart: unless-stopped
minio:
image: minio/minio:latest
container_name: milvus-minio
environment:
MINIO_ACCESS_KEY: "${MINIO_ACCESS_KEY}"
MINIO_SECRET_KEY: "${MINIO_SECRET_KEY}"
ports:
- "9001:9001"
- "9000:9000"
volumes:
- minio_data:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
restart: unless-stopped
milvus:
image: milvusdb/milvus:v2.5-latest
container_name: milvus
ports:
- "19530:19530"
- "9091:9091"
volumes:
- milvus_data:/var/lib/milvus
environment:
ETCD_ENDPOINTS: "etcd:2379"
MINIO_ADDRESS: "minio:9000"
depends_on:
- etcd
- minio
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
timeout: 20s
retries: 3
start_period: 90s
restart: unless-stopped
Vector Database Comparison
| Feature | Qdrant | Milvus | Weaviate | ChromaDB |
|---|---|---|---|---|
| Deployment Complexity | Very Low (1 container) | High (3+ containers) | Low (1 container) | Very Low (1 container) |
| Performance (Millions) | Excellent | Excellent | Good | Fair |
| Performance (Billions) | Good | Excellent | Fair | N/A |
| Filtered Search | ✅ Powerful | ✅ Powerful | ✅ Good | ⚠️ Basic |
| Persistence | ✅ | ✅ | ✅ | ⚠️ Default in-memory |
| Multi-Replica | ✅ | ✅ | ✅ | ❌ |
| gRPC Support | ✅ | ✅ | ❌ | ❌ |
| Docker Compose Fit | ✅ Best | ⚠️ Heavy | ✅ Good | ✅ Dev only |
| Production Ready | ✅ | ✅ | ✅ | ❌ Dev only |
Recommendation: For Docker Compose AI full-stack deployment, Qdrant is the first choice — simple deployment, excellent performance. Consider Milvus when vector count exceeds 100 million. ChromaDB is only suitable for prototyping.
Pattern 3: Embedding Services and Model Management
Embedding services convert text to vectors, a critical step in the RAG pipeline. Docker Compose AI container orchestration offers three mainstream solutions.
Hugging Face TEI (Recommended)
tei:
image: ghcr.io/huggingface/text-embeddings-inference:latest
container_name: tei
ports:
- "8080:80"
volumes:
- tei_cache:/data
environment:
MODEL_ID: "BAAI/bge-m3"
REVISION: "main"
MAX_BATCH_TOKENS: "16384"
MAX_CLIENT_BATCH_SIZE: "32"
HF_TOKEN: "${HF_TOKEN}"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 120s
restart: unless-stopped
Infinity Embedding Service
infinity:
image: michaelf34/infinity:latest
container_name: infinity
ports:
- "7997:7997"
volumes:
- infinity_cache:/app/.cache
environment:
MODEL_ID: "BAAI/bge-m3"
ENGINE: "optimum"
BATCH_SIZE: "32"
command: >
--model-id BAAI/bge-m3
--engine optimum
--port 7997
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:7997/health"]
interval: 10s
timeout: 5s
retries: 3
restart: unless-stopped
Embedding Service Usage Example
import httpx
import numpy as np
async def get_embeddings(texts: list[str]) -> list[list[float]]:
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
"http://tei:80/embed",
json={"inputs": texts}
)
response.raise_for_status()
return response.json()
async def search_similar(query: str, top_k: int = 5) -> list[dict]:
query_embedding = await get_embeddings([query])
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
"http://qdrant:6333/collections/documents/points/search",
json={
"vector": query_embedding[0],
"limit": top_k,
"with_payload": True
}
)
response.raise_for_status()
return response.json()["result"]
Embedding Service Comparison
| Feature | TEI | Infinity | FastEmbed |
|---|---|---|---|
| GPU Acceleration | ✅ Native | ✅ Native | ❌ CPU only |
| Batch Inference | ✅ Efficient | ✅ Efficient | ⚠️ Fair |
| Multi-Model | ✅ | ✅ | ✅ |
| Docker Image Size | ~2GB | ~4GB | ~500MB |
| Production Ready | ✅ | ✅ | ⚠️ Dev only |
| OpenAI-Compatible API | ✅ | ✅ | ❌ |
Pattern 4: API Gateway and Authentication
Production Docker Compose AI full-stack deployment requires an API gateway for unified authentication, rate limiting, and routing.
Traefik Configuration
traefik:
image: traefik:v3.2
container_name: traefik
ports:
- "80:80"
- "443:443"
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
- traefik_certs:/etc/traefik/certs
- ./traefik/dynamic:/etc/traefik/dynamic:ro
command:
- "--api.dashboard=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.file.directory=/etc/traefik/dynamic"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--entrypoints.web.http.redirections.entrypoint.to=websecure"
labels:
traefik.enable: "true"
traefik.http.routers.traefik.rule: "Host(`traefik.ai-stack.local`)"
traefik.http.routers.traefik.entrypoints: "websecure"
traefik.http.routers.traefik.tls: "true"
traefik.http.services.traefik.loadbalancer.server.port: "8080"
restart: unless-stopped
OpenWebUI with Traefik
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- open_webui_data:/app/backend/data
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
labels:
traefik.enable: "true"
traefik.http.routers.webui.rule: "Host(`chat.ai-stack.local`)"
traefik.http.routers.webui.entrypoints: "websecure"
traefik.http.routers.webui.tls: "true"
traefik.http.services.webui.loadbalancer.server.port: "8080"
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
Authentication Middleware
# traefik/dynamic/auth.yml
http:
middlewares:
auth-middleware:
forwardAuth:
address: "http://auth-service:8000/verify"
trustForwardHeader: true
authResponseHeaders:
- "X-User-Id"
- "X-User-Role"
rate-limit:
rateLimit:
average: 30
burst: 60
period: 1m
routers:
api-router:
rule: "Host(`api.ai-stack.local`)"
entrypoints:
- "websecure"
tls: true
middlewares:
- "auth-middleware"
- "rate-limit"
service: "ollama-api"
Pattern 5: GPU Passthrough and Resource Limits
GPU is the core of AI deployment. Docker Compose AI full-stack deployment enables GPU passthrough via deploy.resources.reservations.devices.
NVIDIA GPU Passthrough
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 16G
cpus: "8.0"
environment:
NVIDIA_VISIBLE_DEVICES: "all"
NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
OLLAMA_KEEP_ALIVE: "24h"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
Multi-GPU Allocation
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
environment:
CUDA_VISIBLE_DEVICES: "0"
tei:
image: ghcr.io/huggingface/text-embeddings-inference:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
environment:
CUDA_VISIBLE_DEVICES: "1"
CPU Fallback Configuration
ollama-cpu:
image: ollama/ollama:latest
profiles: ["cpu-only"]
container_name: ollama
volumes:
- ollama_data:/root/.ollama
environment:
OLLAMA_NUM_PARALLEL: "2"
OLLAMA_MAX_LOADED_MODELS: "1"
deploy:
resources:
limits:
memory: 8G
cpus: "4.0"
ollama-gpu:
image: ollama/ollama:latest
profiles: ["gpu"]
container_name: ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
limits:
memory: 16G
Launch commands:
# GPU mode
docker compose --profile gpu up -d
# CPU mode
docker compose --profile cpu-only up -d
GPU Resource Monitoring Script
import subprocess
import json
import time
def monitor_gpu_usage(interval: int = 60):
while True:
result = subprocess.run(
["nvidia-smi", "--query-gpu=index,name,memory.used,memory.total,utilization.gpu",
"--format=csv,noheader,nounits"],
capture_output=True, text=True
)
for line in result.stdout.strip().split("\n"):
idx, name, mem_used, mem_total, util = line.split(", ")
print(f"GPU {idx} ({name}): {mem_used}/{mem_total}MB, Util: {util}%")
time.sleep(interval)
if __name__ == "__main__":
monitor_gpu_usage()
Pattern 6: Monitoring and Observability
Docker Compose AI full-stack deployment monitoring needs to cover GPU utilization, inference latency, vector database performance, and other AI-specific metrics.
Prometheus + Grafana
prometheus:
image: prom/prometheus:v3.2.0
container_name: prometheus
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
- "--storage.tsdb.retention.size=10GB"
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:11.5.0
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
environment:
GF_SECURITY_ADMIN_USER: "${GRAFANA_USER}"
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
GF_USERS_ALLOW_SIGN_UP: "false"
ports:
- "3001:3000"
depends_on:
- prometheus
restart: unless-stopped
dcgm-exporter:
image: nvidia/dcgm-exporter:latest
container_name: dcgm-exporter
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "9400:9400"
restart: unless-stopped
Prometheus Configuration
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "ollama"
static_configs:
- targets: ["ollama:11434"]
metrics_path: "/metrics"
scrape_interval: 30s
- job_name: "qdrant"
static_configs:
- targets: ["qdrant:6333"]
metrics_path: "/metrics"
scrape_interval: 30s
- job_name: "dcgm"
static_configs:
- targets: ["dcgm-exporter:9400"]
scrape_interval: 10s
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "traefik"
static_configs:
- targets: ["traefik:8080"]
Key Alert Rules
# monitoring/alerts.yml
groups:
- name: ai-stack
rules:
- alert: OllamaHighLatency
expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Ollama inference latency too high"
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory usage exceeds 90%"
- alert: QdrantHighLatency
expr: histogram_quantile(0.95, rate(qdrant_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Qdrant query latency too high"
- alert: OllamaContainerDown
expr: up{job="ollama"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Ollama service unavailable"
Pattern 7: Production Hardening and Security
Security is the baseline for production Docker Compose AI full-stack deployment.
Secrets Management
services:
ollama:
image: ollama/ollama:latest
secrets:
- hf_token
environment:
HF_TOKEN_FILE: /run/secrets/hf_token
qdrant:
image: qdrant/qdrant:latest
secrets:
- qdrant_api_key
environment:
QDRANT__SERVICE__API_KEY_FILE: /run/secrets/qdrant_api_key
secrets:
hf_token:
file: ./secrets/hf_token.txt
qdrant_api_key:
file: ./secrets/qdrant_api_key.txt
db_password:
file: ./secrets/db_password.txt
Network Isolation
networks:
frontend:
driver: bridge
backend:
driver: bridge
internal: true
monitoring:
driver: bridge
internal: true
services:
traefik:
networks:
- frontend
- backend
open-webui:
networks:
- frontend
- backend
ollama:
networks:
- backend
qdrant:
networks:
- backend
tei:
networks:
- backend
prometheus:
networks:
- monitoring
- backend
grafana:
networks:
- frontend
- monitoring
Backup Strategy
#!/bin/bash
# scripts/backup-vectors.sh
BACKUP_DIR="/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
echo "Backing up Qdrant..."
curl -s -X POST "http://localhost:6333/snapshots" | jq .
echo "Backing up Ollama models list..."
curl -s "http://localhost:11434/api/tags" | jq . > "$BACKUP_DIR/ollama_models.json"
echo "Backing up environment config..."
cp .env "$BACKUP_DIR/.env.backup"
cp docker-compose.yml "$BACKUP_DIR/docker-compose.yml.backup"
echo "Backup completed: $BACKUP_DIR"
Full Production Configuration
# docker-compose.prod.yml
services:
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 16G
cpus: "8.0"
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 5
window: 120s
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"
read_only: true
tmpfs:
- /tmp
qdrant:
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
logging:
driver: json-file
options:
max-size: "50m"
max-file: "3"
5 Common Pitfalls and Solutions
Pitfall 1: Ollama Model Pull Timeout
Symptom: After docker compose up, Ollama gets stuck pulling models. Large models (e.g., Llama 4 70B) can take over an hour to download.
Solution: Use the model-init service for async pulling. The Ollama service itself doesn't need to wait for models.
model-init:
image: curlimages/curl:latest
depends_on:
ollama:
condition: service_healthy
volumes:
- ./scripts/init-models.sh:/init-models.sh:ro
entrypoint: ["/bin/sh", "/init-models.sh"]
restart: "no"
deploy:
resources:
limits:
memory: 256M
Pitfall 2: Qdrant Container OOM
Symptom: As vector data grows, the Qdrant container gets OOM Killed.
Solution: Set memory limits and enable memory mapping.
qdrant:
image: qdrant/qdrant:latest
deploy:
resources:
limits:
memory: 8G
environment:
QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: "4"
QDRANT__STORAGE__WAL__WAL_CAPACITY_MB: "64"
Pitfall 3: GPU Driver Version Mismatch
Symptom: docker compose up errors with CUDA driver version is insufficient.
Solution: Ensure host NVIDIA driver ≥ 535, install nvidia-container-toolkit.
# Check driver version
nvidia-smi | head -3
# Install nvidia-container-toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Pitfall 4: Embedding Service and LLM Competing for GPU
Symptom: GPU memory insufficient when TEI and Ollama run simultaneously; model loading fails.
Solution: Use device_ids for precise GPU allocation, or run embedding service on CPU.
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
tei:
# Run embedding service in CPU mode
environment:
MODEL_ID: "BAAI/bge-m3"
# No GPU allocation, uses CPU
Pitfall 5: Inter-Container DNS Resolution Failure
Symptom: OpenWebUI reports ollama: Name or service not known.
Solution: Ensure all services are on the same network, use container_name or service name as hostname.
networks:
ai-network:
driver: bridge
services:
ollama:
container_name: ollama
networks:
- ai-network
open-webui:
container_name: open-webui
networks:
- ai-network
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
10 Common Error Troubleshooting
1. could not select device driver — NVIDIA runtime not installed
# Install nvidia-container-toolkit and restart Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker info | grep -i runtime
# Should see nvidia runtime
2. OOM Killed — GPU memory insufficient
# Check GPU memory
nvidia-smi
# Use smaller or quantized models
docker exec ollama ollama run qwen3:4b
3. Connection refused to Ollama — Service not ready
# Check Ollama health status
docker compose ps
docker compose logs ollama
# Wait for healthcheck to pass before connecting
4. permission denied on Docker socket
sudo usermod -aG docker $USER
newgrp docker
5. Qdrant collection not found — Collection not created
curl -X PUT "http://localhost:6333/collections/documents" \
-H "Content-Type: application/json" \
-d '{"vectors": {"size": 1024, "distance": "Cosine"}}'
6. model not found — Ollama model not pulled
docker exec ollama ollama pull qwen3:8b
7. CUDA out of memory — GPU memory overflow during inference
# Reduce parallel requests
# Set in docker-compose.yml
environment:
OLLAMA_NUM_PARALLEL: "1"
OLLAMA_MAX_LOADED_MODELS: "1"
8. TLS handshake error — Traefik certificate issue
# Check certificate file permissions
chmod 600 traefik/acme.json
# Check Traefik logs
docker compose logs traefik
9. too many open files — File descriptor limit
# Temporarily increase limit
ulimit -n 65536
# Permanent setting (/etc/security/limits.conf)
# * soft nofile 65536
# * hard nofile 65536
10. vector dimension mismatch — Embedding dimension inconsistency
# Ensure Qdrant collection dimension matches embedding model output
# bge-m3: 1024 dimensions
# nomic-embed-text: 768 dimensions
curl -X PUT "http://localhost:6333/collections/documents" \
-d '{"vectors": {"size": 1024, "distance": "Cosine"}}'
Advanced Optimization Tips
Multi-Stage Model Warmup
model-warmer:
image: curlimages/curl:latest
container_name: model-warmer
depends_on:
ollama:
condition: service_healthy
entrypoint: >
/bin/sh -c "
echo 'Warming up models...' &&
curl -s http://ollama:11434/api/generate -d '{\"model\":\"qwen3:8b\",\"prompt\":\"hi\",\"stream\":false}' > /dev/null &&
curl -s http://ollama:11434/api/generate -d '{\"model\":\"nomic-embed-text\",\"prompt\":\"test\",\"stream\":false}' > /dev/null &&
echo 'Models warmed up!'
"
restart: "no"
Smart Model Unloading
ollama:
environment:
OLLAMA_KEEP_ALIVE: "5m"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "2"
OLLAMA_KEEP_ALIVE: "5m" automatically unloads models idle for 5 minutes, freeing GPU memory.
Chained Health Check Dependencies
tei:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 5s
retries: 5
start_period: 120s
qdrant:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 3
rag-app:
depends_on:
ollama:
condition: service_healthy
tei:
condition: service_healthy
qdrant:
condition: service_healthy
Docker Compose Watch for Development
# docker-compose.yml
services:
rag-app:
build: .
develop:
watch:
- action: rebuild
path: ./app
target: /app
- action: sync
path: ./app/static
target: /app/static
Comparison: Docker Compose vs K8s vs Docker Swarm
| Dimension | Docker Compose | Kubernetes | Docker Swarm |
|---|---|---|---|
| AI Stack Deployment Complexity | ⭐ Very Low | ⭐⭐⭐⭐⭐ Very High | ⭐⭐ Low |
| GPU Scheduling | ✅ Native | ✅ Device Plugin | ⚠️ Needs config |
| Auto-Scaling | ❌ | ✅ HPA | ⚠️ Manual |
| Service Discovery | ✅ DNS | ✅ CoreDNS | ✅ DNS |
| Rolling Updates | ⚠️ Needs scripts | ✅ Native | ✅ Native |
| Config Management | ✅ .env | ✅ ConfigMap | ⚠️ Config |
| Secret Management | ✅ Docker Secret | ✅ K8s Secret | ⚠️ Basic |
| Monitoring Ecosystem | ✅ Prometheus | ✅ Complete | ⚠️ Limited |
| Multi-Node Orchestration | ❌ Single node | ✅ Core capability | ✅ Native |
| Learning Curve | Low | High | Low |
| Community Activity | ✅ Active | ✅ Very Active | ❌ Declining |
| Suitable AI Project Scale | 1-5 GPUs | 10+ GPUs | 2-5 GPUs |
Recommendation: Docker Compose AI full-stack deployment is ideal for single-machine 1-5 GPU scenarios — the best choice for AI development and small-scale production. For 5+ GPU or multi-node needs, consider Kubernetes + KServe/vLLM. Docker Swarm is not recommended for AI deployment.
Recommended Online Tools
- JSON Formatter - Format Docker Compose and API response JSON data
- Base64 Encode - Encode Secrets and API Key configurations
- cURL to Code - Convert Qdrant/Ollama cURL commands to Python/JS code
Summary
Docker Compose AI full-stack deployment transforms AI dev environments from "three days to set up" to "one command to launch." Ollama + OpenWebUI handles LLM serving, Qdrant handles vector storage, TEI handles embeddings, Traefik handles the gateway, Prometheus + Grafana handles monitoring, and GPU passthrough makes inference fly. 7 patterns cover the full chain from development to production, 5 common pitfalls and 10 error troubleshooting steps help you avoid detours. For 1-5 GPU AI projects, Docker Compose is the most practical AI container orchestration solution in 2026.
Related Posts
- Docker Compose Production Deployment - 7 production strategies from health checks to zero-downtime updates
- Python AI Production Deployment Guide - Best practices for production deployment of Python AI models
- Docker Security Hardening Guide - Container security hardening and vulnerability protection
External References
- Ollama Official Documentation - Complete Ollama model serving documentation
- Qdrant Official Documentation - Vector database deployment and optimization guide
Try these browser-local tools — no sign-up required →