Docker Compose AI Full-Stack Deployment: One-Click Orchestration from LLM to Vector Database

Setting Up an AI Dev Environment Still Takes Three Days?

In 2026, AI development environment setup remains a developer's nightmare. You need to install Ollama for LLM serving, configure Qdrant for vector storage, set up embedding services, build an API gateway for authentication, and handle GPU drivers, CUDA versions, and model downloads... Three days gone, and you haven't written a single line of code.

Docker Compose AI full-stack deployment orchestrates everything into a single file. docker compose up -d launches the entire AI stack in minutes. This article is a complete hands-on guide covering 7 core patterns, 5 common pitfalls, 10 error troubleshooting steps, and production hardening strategies.

Key Takeaways

Docker Compose AI full-stack deployment = LLM + Vector DB + Embedding + API Gateway + Monitoring, one file to rule them all
Ollama + OpenWebUI is the most mature local LLM serving solution
Qdrant/Milvus are the go-to vector databases with dead-simple Docker deployment
GPU passthrough is critical for AI deployment — configure via deploy.resources.reservations.devices
Production requires authentication, rate limiting, monitoring, and backup hardening

AI Full-Stack Architecture Overview
Pattern 1: Ollama + OpenWebUI LLM Serving
Pattern 2: Qdrant/Milvus Vector Databases
Pattern 3: Embedding Services and Model Management
Pattern 4: API Gateway and Authentication
Pattern 5: GPU Passthrough and Resource Limits
Pattern 6: Monitoring and Observability
Pattern 7: Production Hardening and Security
5 Common Pitfalls and Solutions
10 Common Error Troubleshooting
Advanced Optimization Tips
Comparison: Docker Compose vs K8s vs Docker Swarm
Recommended Online Tools
Summary

AI Full-Stack Architecture Overview

Docker Compose AI full-stack deployment is built on a 7-layer architecture, from GPU at the bottom to the API gateway at the top:

┌─────────────────────────────────────────────────────┐
│                   API Gateway                        │
│              (Traefik / Nginx)                       │
│         Auth · Rate Limit · Routing · TLS            │
├──────────┬──────────┬──────────┬────────────────────┤
│ OpenWebUI│  RAG App │  Agent   │  Admin Panel       │
│  (Chat)  │ (Search) │  (Proxy) │  (Management)      │
├──────────┴──────────┴──────────┴────────────────────┤
│              Embedding Service                       │
│       (TEI / Infinity / FastEmbed)                  │
├──────────────────┬──────────────────────────────────┤
│   Ollama LLM     │    vLLM / TGI                    │
│  (Model Serving) │  (High-Perf Inference)           │
├──────────────────┴──────────────────────────────────┤
│           Vector Database                            │
│     (Qdrant / Milvus / Weaviate)                    │
├─────────────────────────────────────────────────────┤
│              Infrastructure                          │
│   Redis · PostgreSQL · MinIO · Prometheus            │
├─────────────────────────────────────────────────────┤
│              GPU / CPU Runtime                       │
│     NVIDIA CUDA · ROCm · CPU Fallback               │
└─────────────────────────────────────────────────────┘

Project directory structure for Docker Compose AI full-stack deployment:

ai-stack/
├── docker-compose.yml
├── docker-compose.gpu.yml
├── docker-compose.prod.yml
├── .env
├── ollama/
│   └── Modelfile
├── qdrant/
│   └── config.yaml
├── traefik/
│   ├── traefik.yml
│   └── acme.json
├── monitoring/
│   ├── prometheus.yml
│   └── grafana/
│       └── dashboards/
└── scripts/
    ├── init-models.sh
    └── backup-vectors.sh

Pattern 1: Ollama + OpenWebUI LLM Serving

Ollama is the most mature local LLM serving solution in 2026, supporting Llama 4, Qwen 3, DeepSeek V3, and other mainstream models. OpenWebUI provides a ChatGPT-style web interface.

Basic Configuration

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      OLLAMA_KEEP_ALIVE: "24h"
      OLLAMA_NUM_PARALLEL: "4"
      OLLAMA_MAX_LOADED_MODELS: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      OLLAMA_BASE_URL: "http://ollama:11434"
      WEBUI_SECRET_KEY: "${WEBUI_SECRET_KEY}"
      ENABLE_SIGNUP: "false"
      DEFAULT_USER_ROLE: "user"
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:

Automatic Model Pulling

After starting Ollama, you need to pull models manually. Automate this with an init script:

#!/bin/bash
# scripts/init-models.sh

MODELS=(
  "qwen3:8b"
  "llama4:8b"
  "deepseek-v3:8b"
  "nomic-embed-text"
)

for model in "${MODELS[@]}"; do
  echo "Pulling model: $model"
  until curl -s http://localhost:11434/api/pull -d "{\"name\":\"$model\"}" | grep -q "success"; do
    echo "  Retrying $model..."
    sleep 5
  done
  echo "  ✓ $model ready"
done

echo "All models pulled successfully!"

Add the init service to Docker Compose:

  model-init:
    image: curlimages/curl:latest
    container_name: model-init
    depends_on:
      ollama:
        condition: service_healthy
    volumes:
      - ./scripts/init-models.sh:/init-models.sh:ro
    entrypoint: ["/bin/sh", "/init-models.sh"]
    restart: "no"

Custom Modelfile

# ollama/Modelfile
FROM qwen3:8b

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"

SYSTEM """
You are a professional AI assistant. When answering questions:
1. Give a concise conclusion first
2. Then provide detailed explanation
3. If uncertain, say so explicitly
"""

Build a custom model:

docker exec ollama ollama create my-assistant -f /root/.ollama/Modelfile

Pattern 2: Qdrant/Milvus Vector Databases

Vector databases are the core of RAG architecture. Docker Compose AI full-stack deployment typically uses Qdrant (lightweight) or Milvus (large-scale).

Qdrant Configuration (Recommended for Small-Medium Projects)

  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_data:/qdrant/storage
      - ./qdrant/config.yaml:/qdrant/config/production.yaml:ro
    environment:
      QDRANT__SERVICE__GRPC_PORT: "6334"
      QDRANT__LOG_LEVEL: "INFO"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

Qdrant configuration file:

# qdrant/config.yaml
storage:
  performance:
    max_search_threads: 4
  wal:
    wal_capacity_mb: 32
    wal_segments_ahead: 0
  optimizers:
    indexing_threshold: 20000
    memmap_threshold: 50000
service:
  max_request_size_mb: 64
  enable_cors: true
telemetry_disabled: true

Milvus Configuration (Recommended for Large-Scale Projects)

  etcd:
    image: quay.io/coreos/etcd:v3.5.16
    container_name: milvus-etcd
    environment:
      ETCD_AUTO_COMPACTION_MODE: "revision"
      ETCD_AUTO_COMPACTION_RETENTION: "1000"
      ETCD_QUOTA_BACKEND_BYTES: "4294967296"
    volumes:
      - etcd_data:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379
    restart: unless-stopped

  minio:
    image: minio/minio:latest
    container_name: milvus-minio
    environment:
      MINIO_ACCESS_KEY: "${MINIO_ACCESS_KEY}"
      MINIO_SECRET_KEY: "${MINIO_SECRET_KEY}"
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - minio_data:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3
    restart: unless-stopped

  milvus:
    image: milvusdb/milvus:v2.5-latest
    container_name: milvus
    ports:
      - "19530:19530"
      - "9091:9091"
    volumes:
      - milvus_data:/var/lib/milvus
    environment:
      ETCD_ENDPOINTS: "etcd:2379"
      MINIO_ADDRESS: "minio:9000"
    depends_on:
      - etcd
      - minio
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      timeout: 20s
      retries: 3
      start_period: 90s
    restart: unless-stopped

Vector Database Comparison

Feature	Qdrant	Milvus	Weaviate	ChromaDB
Deployment Complexity	Very Low (1 container)	High (3+ containers)	Low (1 container)	Very Low (1 container)
Performance (Millions)	Excellent	Excellent	Good	Fair
Performance (Billions)	Good	Excellent	Fair	N/A
Filtered Search	✅ Powerful	✅ Powerful	✅ Good	⚠️ Basic
Persistence	✅	✅	✅	⚠️ Default in-memory
Multi-Replica	✅	✅	✅	❌
gRPC Support	✅	✅	❌	❌
Docker Compose Fit	✅ Best	⚠️ Heavy	✅ Good	✅ Dev only
Production Ready	✅	✅	✅	❌ Dev only

Recommendation: For Docker Compose AI full-stack deployment, Qdrant is the first choice — simple deployment, excellent performance. Consider Milvus when vector count exceeds 100 million. ChromaDB is only suitable for prototyping.

Pattern 3: Embedding Services and Model Management

Embedding services convert text to vectors, a critical step in the RAG pipeline. Docker Compose AI container orchestration offers three mainstream solutions.

Hugging Face TEI (Recommended)

  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:latest
    container_name: tei
    ports:
      - "8080:80"
    volumes:
      - tei_cache:/data
    environment:
      MODEL_ID: "BAAI/bge-m3"
      REVISION: "main"
      MAX_BATCH_TOKENS: "16384"
      MAX_CLIENT_BATCH_SIZE: "32"
      HF_TOKEN: "${HF_TOKEN}"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 120s
    restart: unless-stopped

Infinity Embedding Service

  infinity:
    image: michaelf34/infinity:latest
    container_name: infinity
    ports:
      - "7997:7997"
    volumes:
      - infinity_cache:/app/.cache
    environment:
      MODEL_ID: "BAAI/bge-m3"
      ENGINE: "optimum"
      BATCH_SIZE: "32"
    command: >
      --model-id BAAI/bge-m3
      --engine optimum
      --port 7997
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7997/health"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

Embedding Service Usage Example

import httpx
import numpy as np

async def get_embeddings(texts: list[str]) -> list[list[float]]:
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "http://tei:80/embed",
            json={"inputs": texts}
        )
        response.raise_for_status()
        return response.json()

async def search_similar(query: str, top_k: int = 5) -> list[dict]:
    query_embedding = await get_embeddings([query])
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "http://qdrant:6333/collections/documents/points/search",
            json={
                "vector": query_embedding[0],
                "limit": top_k,
                "with_payload": True
            }
        )
        response.raise_for_status()
        return response.json()["result"]

Embedding Service Comparison

Feature	TEI	Infinity	FastEmbed
GPU Acceleration	✅ Native	✅ Native	❌ CPU only
Batch Inference	✅ Efficient	✅ Efficient	⚠️ Fair
Multi-Model	✅	✅	✅
Docker Image Size	~2GB	~4GB	~500MB
Production Ready	✅	✅	⚠️ Dev only
OpenAI-Compatible API	✅	✅	❌

Pattern 4: API Gateway and Authentication

Production Docker Compose AI full-stack deployment requires an API gateway for unified authentication, rate limiting, and routing.

Traefik Configuration

  traefik:
    image: traefik:v3.2
    container_name: traefik
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
      - traefik_certs:/etc/traefik/certs
      - ./traefik/dynamic:/etc/traefik/dynamic:ro
    command:
      - "--api.dashboard=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--providers.file.directory=/etc/traefik/dynamic"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
    labels:
      traefik.enable: "true"
      traefik.http.routers.traefik.rule: "Host(`traefik.ai-stack.local`)"
      traefik.http.routers.traefik.entrypoints: "websecure"
      traefik.http.routers.traefik.tls: "true"
      traefik.http.services.traefik.loadbalancer.server.port: "8080"
    restart: unless-stopped

OpenWebUI with Traefik

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      OLLAMA_BASE_URL: "http://ollama:11434"
    labels:
      traefik.enable: "true"
      traefik.http.routers.webui.rule: "Host(`chat.ai-stack.local`)"
      traefik.http.routers.webui.entrypoints: "websecure"
      traefik.http.routers.webui.tls: "true"
      traefik.http.services.webui.loadbalancer.server.port: "8080"
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped

Authentication Middleware

# traefik/dynamic/auth.yml
http:
  middlewares:
    auth-middleware:
      forwardAuth:
        address: "http://auth-service:8000/verify"
        trustForwardHeader: true
        authResponseHeaders:
          - "X-User-Id"
          - "X-User-Role"

    rate-limit:
      rateLimit:
        average: 30
        burst: 60
        period: 1m

  routers:
    api-router:
      rule: "Host(`api.ai-stack.local`)"
      entrypoints:
        - "websecure"
      tls: true
      middlewares:
        - "auth-middleware"
        - "rate-limit"
      service: "ollama-api"

Pattern 5: GPU Passthrough and Resource Limits

GPU is the core of AI deployment. Docker Compose AI full-stack deployment enables GPU passthrough via deploy.resources.reservations.devices.

NVIDIA GPU Passthrough

  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 16G
          cpus: "8.0"
    environment:
      NVIDIA_VISIBLE_DEVICES: "all"
      NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
      OLLAMA_KEEP_ALIVE: "24h"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

Multi-GPU Allocation

  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    environment:
      CUDA_VISIBLE_DEVICES: "0"

  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    environment:
      CUDA_VISIBLE_DEVICES: "1"

CPU Fallback Configuration

  ollama-cpu:
    image: ollama/ollama:latest
    profiles: ["cpu-only"]
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    environment:
      OLLAMA_NUM_PARALLEL: "2"
      OLLAMA_MAX_LOADED_MODELS: "1"
    deploy:
      resources:
        limits:
          memory: 8G
          cpus: "4.0"

  ollama-gpu:
    image: ollama/ollama:latest
    profiles: ["gpu"]
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
        limits:
          memory: 16G

Launch commands:

# GPU mode
docker compose --profile gpu up -d

# CPU mode
docker compose --profile cpu-only up -d

GPU Resource Monitoring Script

import subprocess
import json
import time

def monitor_gpu_usage(interval: int = 60):
    while True:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=index,name,memory.used,memory.total,utilization.gpu",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True
        )
        for line in result.stdout.strip().split("\n"):
            idx, name, mem_used, mem_total, util = line.split(", ")
            print(f"GPU {idx} ({name}): {mem_used}/{mem_total}MB, Util: {util}%")
        time.sleep(interval)

if __name__ == "__main__":
    monitor_gpu_usage()

Pattern 6: Monitoring and Observability

Docker Compose AI full-stack deployment monitoring needs to cover GPU utilization, inference latency, vector database performance, and other AI-specific metrics.

Prometheus + Grafana

  prometheus:
    image: prom/prometheus:v3.2.0
    container_name: prometheus
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=10GB"
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.5.0
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
    environment:
      GF_SECURITY_ADMIN_USER: "${GRAFANA_USER}"
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
      GF_USERS_ALLOW_SIGN_UP: "false"
    ports:
      - "3001:3000"
    depends_on:
      - prometheus
    restart: unless-stopped

  dcgm-exporter:
    image: nvidia/dcgm-exporter:latest
    container_name: dcgm-exporter
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "9400:9400"
    restart: unless-stopped

Prometheus Configuration

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "ollama"
    static_configs:
      - targets: ["ollama:11434"]
    metrics_path: "/metrics"
    scrape_interval: 30s

  - job_name: "qdrant"
    static_configs:
      - targets: ["qdrant:6333"]
    metrics_path: "/metrics"
    scrape_interval: 30s

  - job_name: "dcgm"
    static_configs:
      - targets: ["dcgm-exporter:9400"]
    scrape_interval: 10s

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "traefik"
    static_configs:
      - targets: ["traefik:8080"]

Key Alert Rules

# monitoring/alerts.yml
groups:
  - name: ai-stack
    rules:
      - alert: OllamaHighLatency
        expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ollama inference latency too high"

      - alert: GPUMemoryHigh
        expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory usage exceeds 90%"

      - alert: QdrantHighLatency
        expr: histogram_quantile(0.95, rate(qdrant_request_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Qdrant query latency too high"

      - alert: OllamaContainerDown
        expr: up{job="ollama"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Ollama service unavailable"

Pattern 7: Production Hardening and Security

Security is the baseline for production Docker Compose AI full-stack deployment.

Secrets Management

services:
  ollama:
    image: ollama/ollama:latest
    secrets:
      - hf_token
    environment:
      HF_TOKEN_FILE: /run/secrets/hf_token

  qdrant:
    image: qdrant/qdrant:latest
    secrets:
      - qdrant_api_key
    environment:
      QDRANT__SERVICE__API_KEY_FILE: /run/secrets/qdrant_api_key

secrets:
  hf_token:
    file: ./secrets/hf_token.txt
  qdrant_api_key:
    file: ./secrets/qdrant_api_key.txt
  db_password:
    file: ./secrets/db_password.txt

Network Isolation

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true
  monitoring:
    driver: bridge
    internal: true

services:
  traefik:
    networks:
      - frontend
      - backend

  open-webui:
    networks:
      - frontend
      - backend

  ollama:
    networks:
      - backend

  qdrant:
    networks:
      - backend

  tei:
    networks:
      - backend

  prometheus:
    networks:
      - monitoring
      - backend

  grafana:
    networks:
      - frontend
      - monitoring

Backup Strategy

#!/bin/bash
# scripts/backup-vectors.sh

BACKUP_DIR="/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

echo "Backing up Qdrant..."
curl -s -X POST "http://localhost:6333/snapshots" | jq .

echo "Backing up Ollama models list..."
curl -s "http://localhost:11434/api/tags" | jq . > "$BACKUP_DIR/ollama_models.json"

echo "Backing up environment config..."
cp .env "$BACKUP_DIR/.env.backup"
cp docker-compose.yml "$BACKUP_DIR/docker-compose.yml.backup"

echo "Backup completed: $BACKUP_DIR"

Full Production Configuration

# docker-compose.prod.yml
services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 16G
          cpus: "8.0"
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
        window: 120s
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "5"
    read_only: true
    tmpfs:
      - /tmp

  qdrant:
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2.0"
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "3"

5 Common Pitfalls and Solutions

Pitfall 1: Ollama Model Pull Timeout

Symptom: After docker compose up, Ollama gets stuck pulling models. Large models (e.g., Llama 4 70B) can take over an hour to download.

Solution: Use the model-init service for async pulling. The Ollama service itself doesn't need to wait for models.

  model-init:
    image: curlimages/curl:latest
    depends_on:
      ollama:
        condition: service_healthy
    volumes:
      - ./scripts/init-models.sh:/init-models.sh:ro
    entrypoint: ["/bin/sh", "/init-models.sh"]
    restart: "no"
    deploy:
      resources:
        limits:
          memory: 256M

Pitfall 2: Qdrant Container OOM

Symptom: As vector data grows, the Qdrant container gets OOM Killed.

Solution: Set memory limits and enable memory mapping.

  qdrant:
    image: qdrant/qdrant:latest
    deploy:
      resources:
        limits:
          memory: 8G
    environment:
      QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: "4"
      QDRANT__STORAGE__WAL__WAL_CAPACITY_MB: "64"

Pitfall 3: GPU Driver Version Mismatch

Symptom: docker compose up errors with CUDA driver version is insufficient.

Solution: Ensure host NVIDIA driver ≥ 535, install nvidia-container-toolkit.

# Check driver version
nvidia-smi | head -3

# Install nvidia-container-toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Pitfall 4: Embedding Service and LLM Competing for GPU

Symptom: GPU memory insufficient when TEI and Ollama run simultaneously; model loading fails.

Solution: Use device_ids for precise GPU allocation, or run embedding service on CPU.

  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  tei:
    # Run embedding service in CPU mode
    environment:
      MODEL_ID: "BAAI/bge-m3"
      # No GPU allocation, uses CPU

Pitfall 5: Inter-Container DNS Resolution Failure

Symptom: OpenWebUI reports ollama: Name or service not known.

Solution: Ensure all services are on the same network, use container_name or service name as hostname.

networks:
  ai-network:
    driver: bridge

services:
  ollama:
    container_name: ollama
    networks:
      - ai-network

  open-webui:
    container_name: open-webui
    networks:
      - ai-network
    environment:
      OLLAMA_BASE_URL: "http://ollama:11434"

10 Common Error Troubleshooting

1. `could not select device driver` — NVIDIA runtime not installed

# Install nvidia-container-toolkit and restart Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker info | grep -i runtime
# Should see nvidia runtime

2. `OOM Killed` — GPU memory insufficient

# Check GPU memory
nvidia-smi
# Use smaller or quantized models
docker exec ollama ollama run qwen3:4b

3. `Connection refused` to Ollama — Service not ready

# Check Ollama health status
docker compose ps
docker compose logs ollama
# Wait for healthcheck to pass before connecting

4. `permission denied` on Docker socket

sudo usermod -aG docker $USER
newgrp docker

5. `Qdrant collection not found` — Collection not created

curl -X PUT "http://localhost:6333/collections/documents" \
  -H "Content-Type: application/json" \
  -d '{"vectors": {"size": 1024, "distance": "Cosine"}}'

6. `model not found` — Ollama model not pulled

docker exec ollama ollama pull qwen3:8b

7. `CUDA out of memory` — GPU memory overflow during inference

# Reduce parallel requests
# Set in docker-compose.yml
environment:
  OLLAMA_NUM_PARALLEL: "1"
  OLLAMA_MAX_LOADED_MODELS: "1"

8. `TLS handshake error` — Traefik certificate issue

# Check certificate file permissions
chmod 600 traefik/acme.json
# Check Traefik logs
docker compose logs traefik

9. `too many open files` — File descriptor limit

# Temporarily increase limit
ulimit -n 65536
# Permanent setting (/etc/security/limits.conf)
# * soft nofile 65536
# * hard nofile 65536

10. `vector dimension mismatch` — Embedding dimension inconsistency

# Ensure Qdrant collection dimension matches embedding model output
# bge-m3: 1024 dimensions
# nomic-embed-text: 768 dimensions
curl -X PUT "http://localhost:6333/collections/documents" \
  -d '{"vectors": {"size": 1024, "distance": "Cosine"}}'

Advanced Optimization Tips

Multi-Stage Model Warmup

  model-warmer:
    image: curlimages/curl:latest
    container_name: model-warmer
    depends_on:
      ollama:
        condition: service_healthy
    entrypoint: >
      /bin/sh -c "
        echo 'Warming up models...' &&
        curl -s http://ollama:11434/api/generate -d '{\"model\":\"qwen3:8b\",\"prompt\":\"hi\",\"stream\":false}' > /dev/null &&
        curl -s http://ollama:11434/api/generate -d '{\"model\":\"nomic-embed-text\",\"prompt\":\"test\",\"stream\":false}' > /dev/null &&
        echo 'Models warmed up!'
      "
    restart: "no"

Smart Model Unloading

  ollama:
    environment:
      OLLAMA_KEEP_ALIVE: "5m"
      OLLAMA_NUM_PARALLEL: "4"
      OLLAMA_MAX_LOADED_MODELS: "2"

OLLAMA_KEEP_ALIVE: "5m" automatically unloads models idle for 5 minutes, freeing GPU memory.

Chained Health Check Dependencies

  tei:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 120s

  qdrant:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 3

  rag-app:
    depends_on:
      ollama:
        condition: service_healthy
      tei:
        condition: service_healthy
      qdrant:
        condition: service_healthy

Docker Compose Watch for Development

# docker-compose.yml
services:
  rag-app:
    build: .
    develop:
      watch:
        - action: rebuild
          path: ./app
          target: /app
        - action: sync
          path: ./app/static
          target: /app/static

Comparison: Docker Compose vs K8s vs Docker Swarm

Dimension	Docker Compose	Kubernetes	Docker Swarm
AI Stack Deployment Complexity	⭐ Very Low	⭐⭐⭐⭐⭐ Very High	⭐⭐ Low
GPU Scheduling	✅ Native	✅ Device Plugin	⚠️ Needs config
Auto-Scaling	❌	✅ HPA	⚠️ Manual
Service Discovery	✅ DNS	✅ CoreDNS	✅ DNS
Rolling Updates	⚠️ Needs scripts	✅ Native	✅ Native
Config Management	✅ .env	✅ ConfigMap	⚠️ Config
Secret Management	✅ Docker Secret	✅ K8s Secret	⚠️ Basic
Monitoring Ecosystem	✅ Prometheus	✅ Complete	⚠️ Limited
Multi-Node Orchestration	❌ Single node	✅ Core capability	✅ Native
Learning Curve	Low	High	Low
Community Activity	✅ Active	✅ Very Active	❌ Declining
Suitable AI Project Scale	1-5 GPUs	10+ GPUs	2-5 GPUs

Recommendation: Docker Compose AI full-stack deployment is ideal for single-machine 1-5 GPU scenarios — the best choice for AI development and small-scale production. For 5+ GPU or multi-node needs, consider Kubernetes + KServe/vLLM. Docker Swarm is not recommended for AI deployment.

Recommended Online Tools

JSON Formatter - Format Docker Compose and API response JSON data
Base64 Encode - Encode Secrets and API Key configurations
cURL to Code - Convert Qdrant/Ollama cURL commands to Python/JS code

Summary

Docker Compose AI full-stack deployment transforms AI dev environments from "three days to set up" to "one command to launch." Ollama + OpenWebUI handles LLM serving, Qdrant handles vector storage, TEI handles embeddings, Traefik handles the gateway, Prometheus + Grafana handles monitoring, and GPU passthrough makes inference fly. 7 patterns cover the full chain from development to production, 5 common pitfalls and 10 error troubleshooting steps help you avoid detours. For 1-5 GPU AI projects, Docker Compose is the most practical AI container orchestration solution in 2026.

Docker Compose Production Deployment - 7 production strategies from health checks to zero-downtime updates
Python AI Production Deployment Guide - Best practices for production deployment of Python AI models
Docker Security Hardening Guide - Container security hardening and vulnerability protection

External References

Ollama Official Documentation - Complete Ollama model serving documentation
Qdrant Official Documentation - Vector database deployment and optimization guide