Docker Compose AI全棧部署：從LLM到向量資料庫的一鍵編排

搭個AI開發環境，三天還沒跑起來？

2026年了，AI開發環境搭建依然是開發者的噩夢。你要裝Ollama跑LLM，配Qdrant存向量，搭嵌入服務做Embedding，搞API閘道做認證，還得處理GPU驅動、CUDA版本、模型下載……一套下來三天過去了，程式碼還沒寫一行。

Docker Compose AI全棧部署可以把這些全部編排到一個檔案裡，docker compose up -d 一鍵拉起整個AI技術棧。本文是Docker Compose AI容器編排的完整實戰指南，覆蓋7大核心模式、5個常見坑、10個報錯排查，以及生產環境加固方案。

核心要點

Docker Compose AI全棧部署 = LLM + 向量資料庫 + 嵌入服務 + API閘道 + 監控，一個檔案搞定
Ollama + OpenWebUI 是最成熟的本地LLM服務方案
Qdrant/Milvus 是向量資料庫的首選，Docker部署極簡
GPU透傳是AI部署的關鍵，deploy.resources.reservations.devices 配置
生產環境需要認證、限流、監控、備份全套加固

AI全棧架構全景
Pattern 1: Ollama + OpenWebUI LLM服務
Pattern 2: Qdrant/Milvus向量資料庫
Pattern 3: 嵌入服務與模型管理
Pattern 4: API閘道與認證
Pattern 5: GPU透傳與資源限制
Pattern 6: 監控與可觀測性
Pattern 7: 生產加固與安全
5個常見坑及解決方案
10個常見報錯排查
進階優化技巧
對比分析：Docker Compose vs K8s vs Docker Swarm
線上工具推薦
總結

AI全棧架構全景

Docker Compose AI全棧部署的核心是7層架構，從底層GPU到頂層閘道，每一層都有對應的容器服務：

┌─────────────────────────────────────────────────────┐
│                   API Gateway                        │
│              (Traefik / Nginx)                       │
│         認證 · 限流 · 路由 · TLS                      │
├──────────┬──────────┬──────────┬────────────────────┤
│ OpenWebUI│  RAG App │  Agent   │  Admin Panel       │
│  (Chat)  │  (檢索)   │  (代理)  │  (管理)             │
├──────────┴──────────┴──────────┴────────────────────┤
│              Embedding Service                       │
│       (TEI / Infinity / FastEmbed)                  │
├──────────────────┬──────────────────────────────────┤
│   Ollama LLM     │    vLLM / TGI                    │
│  (模型服務)       │   (高效能推理)                     │
├──────────────────┴──────────────────────────────────┤
│           Vector Database                            │
│     (Qdrant / Milvus / Weaviate)                    │
├─────────────────────────────────────────────────────┤
│              Infrastructure                          │
│   Redis · PostgreSQL · MinIO · Prometheus            │
├─────────────────────────────────────────────────────┤
│              GPU / CPU Runtime                       │
│     NVIDIA CUDA · ROCm · CPU Fallback               │
└─────────────────────────────────────────────────────┘

完整Docker Compose AI全棧部署的目錄結構：

ai-stack/
├── docker-compose.yml
├── docker-compose.gpu.yml
├── docker-compose.prod.yml
├── .env
├── ollama/
│   └── Modelfile
├── qdrant/
│   └── config.yaml
├── traefik/
│   ├── traefik.yml
│   └── acme.json
├── monitoring/
│   ├── prometheus.yml
│   └── grafana/
│       └── dashboards/
└── scripts/
    ├── init-models.sh
    └── backup-vectors.sh

Pattern 1: Ollama + OpenWebUI LLM服務

Ollama是2026年最成熟的本地LLM服務方案，支援Llama 4、Qwen 3、DeepSeek V3等主流模型。OpenWebUI提供ChatGPT風格的Web介面。

基礎配置

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      OLLAMA_KEEP_ALIVE: "24h"
      OLLAMA_NUM_PARALLEL: "4"
      OLLAMA_MAX_LOADED_MODELS: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      OLLAMA_BASE_URL: "http://ollama:11434"
      WEBUI_SECRET_KEY: "${WEBUI_SECRET_KEY}"
      ENABLE_SIGNUP: "false"
      DEFAULT_USER_ROLE: "user"
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:

模型自動拉取

啟動Ollama後需要手動拉取模型，但可以透過初始化指令碼自動化：

#!/bin/bash
# scripts/init-models.sh

MODELS=(
  "qwen3:8b"
  "llama4:8b"
  "deepseek-v3:8b"
  "nomic-embed-text"
)

for model in "${MODELS[@]}"; do
  echo "Pulling model: $model"
  until curl -s http://localhost:11434/api/pull -d "{\"name\":\"$model\"}" | grep -q "success"; do
    echo "  Retrying $model..."
    sleep 5
  done
  echo "  ✓ $model ready"
done

echo "All models pulled successfully!"

在Docker Compose中加入初始化服務：

  model-init:
    image: curlimages/curl:latest
    container_name: model-init
    depends_on:
      ollama:
        condition: service_healthy
    volumes:
      - ./scripts/init-models.sh:/init-models.sh:ro
    entrypoint: ["/bin/sh", "/init-models.sh"]
    restart: "no"

自訂Modelfile

# ollama/Modelfile
FROM qwen3:8b

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"

SYSTEM """
你是一個專業的AI助手。回答問題時：
1. 先給出簡潔的結論
2. 再提供詳細的解釋
3. 如果不確定，明確說明
"""

建構自訂模型：

docker exec ollama ollama create my-assistant -f /root/.ollama/Modelfile

Pattern 2: Qdrant/Milvus向量資料庫

向量資料庫是RAG架構的核心，Docker Compose AI全棧部署中通常選擇Qdrant（輕量）或Milvus（大規模）。

Qdrant配置（推薦小中型專案）

  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_data:/qdrant/storage
      - ./qdrant/config.yaml:/qdrant/config/production.yaml:ro
    environment:
      QDRANT__SERVICE__GRPC_PORT: "6334"
      QDRANT__LOG_LEVEL: "INFO"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

Qdrant配置檔案：

# qdrant/config.yaml
storage:
  performance:
    max_search_threads: 4
  wal:
    wal_capacity_mb: 32
    wal_segments_ahead: 0
  optimizers:
    indexing_threshold: 20000
    memmap_threshold: 50000
service:
  max_request_size_mb: 64
  enable_cors: true
telemetry_disabled: true

Milvus配置（推薦大規模專案）

  etcd:
    image: quay.io/coreos/etcd:v3.5.16
    container_name: milvus-etcd
    environment:
      ETCD_AUTO_COMPACTION_MODE: "revision"
      ETCD_AUTO_COMPACTION_RETENTION: "1000"
      ETCD_QUOTA_BACKEND_BYTES: "4294967296"
    volumes:
      - etcd_data:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379
    restart: unless-stopped

  minio:
    image: minio/minio:latest
    container_name: milvus-minio
    environment:
      MINIO_ACCESS_KEY: "${MINIO_ACCESS_KEY}"
      MINIO_SECRET_KEY: "${MINIO_SECRET_KEY}"
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - minio_data:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3
    restart: unless-stopped

  milvus:
    image: milvusdb/milvus:v2.5-latest
    container_name: milvus
    ports:
      - "19530:19530"
      - "9091:9091"
    volumes:
      - milvus_data:/var/lib/milvus
    environment:
      ETCD_ENDPOINTS: "etcd:2379"
      MINIO_ADDRESS: "minio:9000"
    depends_on:
      - etcd
      - minio
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      timeout: 20s
      retries: 3
      start_period: 90s
    restart: unless-stopped

向量資料庫對比

特性	Qdrant	Milvus	Weaviate	ChromaDB
部署複雜度	極低（單容器）	高（3+容器）	低（單容器）	極低（單容器）
效能（百萬級）	優秀	優秀	良好	一般
效能（億級）	良好	優秀	一般	不適用
過濾搜尋	✅ 強大	✅ 強大	✅ 良好	⚠️ 基礎
持久化	✅	✅	✅	⚠️ 預設記憶體
多副本	✅	✅	✅	❌
gRPC支援	✅	✅	❌	❌
Docker Compose適配	✅ 最佳	⚠️ 較重	✅ 良好	✅ 開發用
生產就緒	✅	✅	✅	❌ 僅開發

選型建議：Docker Compose AI全棧部署首選Qdrant，部署簡單、效能優秀。資料量超過1億向量時考慮Milvus。ChromaDB僅適合原型驗證。

Pattern 3: 嵌入服務與模型管理

嵌入服務將文字轉為向量，是RAG流程的關鍵環節。Docker Compose AI容器編排中有三種主流方案。

Hugging Face TEI（推薦）

  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:latest
    container_name: tei
    ports:
      - "8080:80"
    volumes:
      - tei_cache:/data
    environment:
      MODEL_ID: "BAAI/bge-m3"
      REVISION: "main"
      MAX_BATCH_TOKENS: "16384"
      MAX_CLIENT_BATCH_SIZE: "32"
      HF_TOKEN: "${HF_TOKEN}"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 120s
    restart: unless-stopped

Infinity嵌入服務

  infinity:
    image: michaelf34/infinity:latest
    container_name: infinity
    ports:
      - "7997:7997"
    volumes:
      - infinity_cache:/app/.cache
    environment:
      MODEL_ID: "BAAI/bge-m3"
      ENGINE: "optimum"
      BATCH_SIZE: "32"
    command: >
      --model-id BAAI/bge-m3
      --engine optimum
      --port 7997
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7997/health"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

嵌入服務呼叫範例

import httpx
import numpy as np

async def get_embeddings(texts: list[str]) -> list[list[float]]:
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "http://tei:80/embed",
            json={"inputs": texts}
        )
        response.raise_for_status()
        return response.json()

async def search_similar(query: str, top_k: int = 5) -> list[dict]:
    query_embedding = await get_embeddings([query])
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "http://qdrant:6333/collections/documents/points/search",
            json={
                "vector": query_embedding[0],
                "limit": top_k,
                "with_payload": True
            }
        )
        response.raise_for_status()
        return response.json()["result"]

嵌入服務對比

特性	TEI	Infinity	FastEmbed
GPU加速	✅ 原生	✅ 原生	❌ CPU
批量推理	✅ 高效	✅ 高效	⚠️ 一般
多模型	✅	✅	✅
Docker映像大小	~2GB	~4GB	~500MB
生產就緒	✅	✅	⚠️ 開發用
OpenAI相容API	✅	✅	❌

Pattern 4: API閘道與認證

生產環境的Docker Compose AI全棧部署必須有API閘道，統一認證、限流和路由。

Traefik配置

  traefik:
    image: traefik:v3.2
    container_name: traefik
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
      - traefik_certs:/etc/traefik/certs
      - ./traefik/dynamic:/etc/traefik/dynamic:ro
    command:
      - "--api.dashboard=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--providers.file.directory=/etc/traefik/dynamic"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
    labels:
      traefik.enable: "true"
      traefik.http.routers.traefik.rule: "Host(`traefik.ai-stack.local`)"
      traefik.http.routers.traefik.entrypoints: "websecure"
      traefik.http.routers.traefik.tls: "true"
      traefik.http.services.traefik.loadbalancer.server.port: "8080"
    restart: unless-stopped

OpenWebUI接入Traefik

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      OLLAMA_BASE_URL: "http://ollama:11434"
    labels:
      traefik.enable: "true"
      traefik.http.routers.webui.rule: "Host(`chat.ai-stack.local`)"
      traefik.http.routers.webui.entrypoints: "websecure"
      traefik.http.routers.webui.tls: "true"
      traefik.http.services.webui.loadbalancer.server.port: "8080"
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped

認證中介軟體

# traefik/dynamic/auth.yml
http:
  middlewares:
    auth-middleware:
      forwardAuth:
        address: "http://auth-service:8000/verify"
        trustForwardHeader: true
        authResponseHeaders:
          - "X-User-Id"
          - "X-User-Role"

    rate-limit:
      rateLimit:
        average: 30
        burst: 60
        period: 1m

  routers:
    api-router:
      rule: "Host(`api.ai-stack.local`)"
      entrypoints:
        - "websecure"
      tls: true
      middlewares:
        - "auth-middleware"
        - "rate-limit"
      service: "ollama-api"

Pattern 5: GPU透傳與資源限制

AI部署的核心是GPU。Docker Compose AI全棧部署透過 deploy.resources.reservations.devices 實現GPU透傳。

NVIDIA GPU透傳

  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 16G
          cpus: "8.0"
    environment:
      NVIDIA_VISIBLE_DEVICES: "all"
      NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
      OLLAMA_KEEP_ALIVE: "24h"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

多GPU分配

  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    environment:
      CUDA_VISIBLE_DEVICES: "0"

  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    environment:
      CUDA_VISIBLE_DEVICES: "1"

CPU Fallback配置

  ollama-cpu:
    image: ollama/ollama:latest
    profiles: ["cpu-only"]
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    environment:
      OLLAMA_NUM_PARALLEL: "2"
      OLLAMA_MAX_LOADED_MODELS: "1"
    deploy:
      resources:
        limits:
          memory: 8G
          cpus: "4.0"

  ollama-gpu:
    image: ollama/ollama:latest
    profiles: ["gpu"]
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
        limits:
          memory: 16G

啟動方式：

# GPU模式
docker compose --profile gpu up -d

# CPU模式
docker compose --profile cpu-only up -d

GPU資源監控指令碼

import subprocess
import json
import time

def monitor_gpu_usage(interval: int = 60):
    while True:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=index,name,memory.used,memory.total,utilization.gpu",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True
        )
        for line in result.stdout.strip().split("\n"):
            idx, name, mem_used, mem_total, util = line.split(", ")
            print(f"GPU {idx} ({name}): {mem_used}/{mem_total}MB, Util: {util}%")
        time.sleep(interval)

if __name__ == "__main__":
    monitor_gpu_usage()

Pattern 6: 監控與可觀測性

Docker Compose AI全棧部署的監控需要覆蓋GPU使用率、推理延遲、向量資料庫效能等AI特有指標。

Prometheus + Grafana

  prometheus:
    image: prom/prometheus:v3.2.0
    container_name: prometheus
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=10GB"
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.5.0
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
    environment:
      GF_SECURITY_ADMIN_USER: "${GRAFANA_USER}"
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
      GF_USERS_ALLOW_SIGN_UP: "false"
    ports:
      - "3001:3000"
    depends_on:
      - prometheus
    restart: unless-stopped

  dcgm-exporter:
    image: nvidia/dcgm-exporter:latest
    container_name: dcgm-exporter
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "9400:9400"
    restart: unless-stopped

Prometheus配置

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "ollama"
    static_configs:
      - targets: ["ollama:11434"]
    metrics_path: "/metrics"
    scrape_interval: 30s

  - job_name: "qdrant"
    static_configs:
      - targets: ["qdrant:6333"]
    metrics_path: "/metrics"
    scrape_interval: 30s

  - job_name: "dcgm"
    static_configs:
      - targets: ["dcgm-exporter:9400"]
    scrape_interval: 10s

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "traefik"
    static_configs:
      - targets: ["traefik:8080"]

關鍵告警規則

# monitoring/alerts.yml
groups:
  - name: ai-stack
    rules:
      - alert: OllamaHighLatency
        expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ollama推理延遲過高"

      - alert: GPUMemoryHigh
        expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU顯存使用率超過90%"

      - alert: QdrantHighLatency
        expr: histogram_quantile(0.95, rate(qdrant_request_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Qdrant查詢延遲過高"

      - alert: OllamaContainerDown
        expr: up{job="ollama"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Ollama服務不可用"

Pattern 7: 生產加固與安全

Docker Compose AI全棧部署上生產，安全是底線。

Secrets管理

services:
  ollama:
    image: ollama/ollama:latest
    secrets:
      - hf_token
    environment:
      HF_TOKEN_FILE: /run/secrets/hf_token

  qdrant:
    image: qdrant/qdrant:latest
    secrets:
      - qdrant_api_key
    environment:
      QDRANT__SERVICE__API_KEY_FILE: /run/secrets/qdrant_api_key

secrets:
  hf_token:
    file: ./secrets/hf_token.txt
  qdrant_api_key:
    file: ./secrets/qdrant_api_key.txt
  db_password:
    file: ./secrets/db_password.txt

網路隔離

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true
  monitoring:
    driver: bridge
    internal: true

services:
  traefik:
    networks:
      - frontend
      - backend

  open-webui:
    networks:
      - frontend
      - backend

  ollama:
    networks:
      - backend

  qdrant:
    networks:
      - backend

  tei:
    networks:
      - backend

  prometheus:
    networks:
      - monitoring
      - backend

  grafana:
    networks:
      - frontend
      - monitoring

備份策略

#!/bin/bash
# scripts/backup-vectors.sh

BACKUP_DIR="/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

echo "Backing up Qdrant..."
curl -s -X POST "http://localhost:6333/snapshots" | jq .

echo "Backing up Ollama models list..."
curl -s "http://localhost:11434/api/tags" | jq . > "$BACKUP_DIR/ollama_models.json"

echo "Backing up environment config..."
cp .env "$BACKUP_DIR/.env.backup"
cp docker-compose.yml "$BACKUP_DIR/docker-compose.yml.backup"

echo "Backup completed: $BACKUP_DIR"

完整生產配置

# docker-compose.prod.yml
services:
  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 16G
          cpus: "8.0"
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
        window: 120s
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "5"
    read_only: true
    tmpfs:
      - /tmp

  qdrant:
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2.0"
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "3"

5個常見坑及解決方案

坑1：Ollama模型拉取超時

現象：docker compose up 後Ollama一直卡在拉取模型，大模型（如Llama 4 70B）下載可能超過1小時。

解決方案：使用 model-init 服務非同步拉取，Ollama服務本身不需要等模型就緒。

  model-init:
    image: curlimages/curl:latest
    depends_on:
      ollama:
        condition: service_healthy
    volumes:
      - ./scripts/init-models.sh:/init-models.sh:ro
    entrypoint: ["/bin/sh", "/init-models.sh"]
    restart: "no"
    deploy:
      resources:
        limits:
          memory: 256M

坑2：Qdrant容器OOM

現象：向量資料量增長後，Qdrant容器被OOM Killed。

解決方案：設定記憶體限制並啟用記憶體映射。

  qdrant:
    image: qdrant/qdrant:latest
    deploy:
      resources:
        limits:
          memory: 8G
    environment:
      QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: "4"
      QDRANT__STORAGE__WAL__WAL_CAPACITY_MB: "64"

坑3：GPU驅動版本不匹配

現象：docker compose up 報錯 CUDA driver version is insufficient。

解決方案：確保宿主機NVIDIA驅動 ≥ 535，安裝 nvidia-container-toolkit。

# 檢查驅動版本
nvidia-smi | head -3

# 安裝nvidia-container-toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

坑4：嵌入服務與LLM爭搶GPU

現象：TEI和Ollama同時執行時GPU顯存不足，模型載入失敗。

解決方案：使用 device_ids 精確分配GPU，或讓嵌入服務跑CPU。

  ollama:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  tei:
    # CPU模式執行嵌入服務
    environment:
      MODEL_ID: "BAAI/bge-m3"
      # 不分配GPU，使用CPU

坑5：容器間DNS解析失敗

現象：OpenWebUI報 ollama: Name or service not known。

解決方案：確保所有服務在同一網路，使用 container_name 或服務名作為主機名。

networks:
  ai-network:
    driver: bridge

services:
  ollama:
    container_name: ollama
    networks:
      - ai-network

  open-webui:
    container_name: open-webui
    networks:
      - ai-network
    environment:
      OLLAMA_BASE_URL: "http://ollama:11434"

10個常見報錯排查

1. `could not select device driver` — NVIDIA執行時未安裝

# 安裝nvidia-container-toolkit後重啟Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker info | grep -i runtime
# 應該看到 nvidia runtime

2. `OOM Killed` — GPU顯存不足

# 檢查GPU顯存
nvidia-smi
# 減小模型或使用量化版本
docker exec ollama ollama run qwen3:4b

3. `Connection refused` to Ollama — 服務未就緒

# 檢查Ollama健康狀態
docker compose ps
docker compose logs ollama
# 等待healthcheck通過後再連線

4. `permission denied` on Docker socket

sudo usermod -aG docker $USER
newgrp docker

5. `Qdrant collection not found` — 集合未建立

curl -X PUT "http://localhost:6333/collections/documents" \
  -H "Content-Type: application/json" \
  -d '{"vectors": {"size": 1024, "distance": "Cosine"}}'

6. `model not found` — Ollama模型未拉取

docker exec ollama ollama pull qwen3:8b

7. `CUDA out of memory` — 推理時顯存溢位

# 減少並行請求數
# 在docker-compose.yml中設定
environment:
  OLLAMA_NUM_PARALLEL: "1"
  OLLAMA_MAX_LOADED_MODELS: "1"

8. `TLS handshake error` — Traefik憑證問題

# 檢查憑證檔案權限
chmod 600 traefik/acme.json
# 檢視Traefik日誌
docker compose logs traefik

9. `too many open files` — 檔案描述符限制

# 臨時提高限制
ulimit -n 65536
# 永久設定（/etc/security/limits.conf）
# * soft nofile 65536
# * hard nofile 65536

10. `vector dimension mismatch` — 嵌入維度不一致

# 確保Qdrant集合維度與嵌入模型輸出維度一致
# bge-m3: 1024維
# nomic-embed-text: 768維
curl -X PUT "http://localhost:6333/collections/documents" \
  -d '{"vectors": {"size": 1024, "distance": "Cosine"}}'

進階優化技巧

多階段模型預熱

  model-warmer:
    image: curlimages/curl:latest
    container_name: model-warmer
    depends_on:
      ollama:
        condition: service_healthy
    entrypoint: >
      /bin/sh -c "
        echo 'Warming up models...' &&
        curl -s http://ollama:11434/api/generate -d '{\"model\":\"qwen3:8b\",\"prompt\":\"hi\",\"stream\":false}' > /dev/null &&
        curl -s http://ollama:11434/api/generate -d '{\"model\":\"nomic-embed-text\",\"prompt\":\"test\",\"stream\":false}' > /dev/null &&
        echo 'Models warmed up!'
      "
    restart: "no"

智慧模型解除安裝

  ollama:
    environment:
      OLLAMA_KEEP_ALIVE: "5m"
      OLLAMA_NUM_PARALLEL: "4"
      OLLAMA_MAX_LOADED_MODELS: "2"

OLLAMA_KEEP_ALIVE: "5m" 讓5分鐘無請求的模型自動解除安裝，釋放GPU顯存。

健康檢查鏈式依賴

  tei:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 120s

  qdrant:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 3

  rag-app:
    depends_on:
      ollama:
        condition: service_healthy
      tei:
        condition: service_healthy
      qdrant:
        condition: service_healthy

Docker Compose Watch自動開發

# docker-compose.yml
services:
  rag-app:
    build: .
    develop:
      watch:
        - action: rebuild
          path: ./app
          target: /app
        - action: sync
          path: ./app/static
          target: /app/static

對比分析：Docker Compose vs K8s vs Docker Swarm

維度	Docker Compose	Kubernetes	Docker Swarm
AI全棧部署複雜度	⭐ 極低	⭐⭐⭐⭐⭐ 極高	⭐⭐ 低
GPU排程	✅ 原生支援	✅ Device Plugin	⚠️ 需配置
自動擴縮容	❌	✅ HPA	⚠️ 手動
服務發現	✅ DNS	✅ CoreDNS	✅ DNS
滾動更新	⚠️ 需指令碼	✅ 原生	✅ 原生
配置管理	✅ .env	✅ ConfigMap	⚠️ Config
Secret管理	✅ Docker Secret	✅ K8s Secret	⚠️ 基礎
監控生態	✅ Prometheus	✅ 完整	⚠️ 有限
多節點編排	❌ 單機	✅ 核心能力	✅ 原生
學習曲線	低	高	低
社群活躍度	✅ 活躍	✅ 極活躍	❌ 衰退
適用AI專案規模	1-5 GPU	10+ GPU	2-5 GPU

選型建議：Docker Compose AI全棧部署適合單機1-5 GPU的場景，是AI開發和小規模生產的最佳選擇。超過5 GPU或多節點需求時，考慮Kubernetes + KServe/vLLM。Docker Swarm不推薦用於AI部署。

線上工具推薦

JSON格式化 - 格式化Docker Compose和API回傳的JSON資料
Base64編碼 - 編碼Secrets和API Key等敏感配置
cURL轉程式碼 - 將Qdrant/Ollama的cURL命令轉為Python/JS程式碼

總結

Docker Compose AI全棧部署讓AI開發環境從「三天搭不起來」變成「一條命令拉起」。Ollama + OpenWebUI搞定LLM服務，Qdrant搞定向量儲存，TEI搞定嵌入，Traefik搞定閘道，Prometheus + Grafana搞定監控，GPU透傳讓推理飛起來。7大模式覆蓋從開發到生產的全鏈路，5個常見坑和10個報錯排查幫你少走彎路。對於1-5 GPU的AI專案，Docker Compose就是2026年最實用的AI容器編排方案。

Docker Compose生產部署實戰 - 健康檢查、零停機更新等7大生產策略
Python AI生產部署指南 - Python AI模型的生產部署最佳實踐
Docker安全加固指南 - 容器安全加固與漏洞防護

外部參考

Ollama官方文件 - Ollama模型服務完整文件
Qdrant官方文件 - 向量資料庫部署與最佳化指南