Docker Compose AI全棧部署:從LLM到向量資料庫的一鍵編排
搭個AI開發環境,三天還沒跑起來?
2026年了,AI開發環境搭建依然是開發者的噩夢。你要裝Ollama跑LLM,配Qdrant存向量,搭嵌入服務做Embedding,搞API閘道做認證,還得處理GPU驅動、CUDA版本、模型下載……一套下來三天過去了,程式碼還沒寫一行。
Docker Compose AI全棧部署可以把這些全部編排到一個檔案裡,docker compose up -d 一鍵拉起整個AI技術棧。本文是Docker Compose AI容器編排的完整實戰指南,覆蓋7大核心模式、5個常見坑、10個報錯排查,以及生產環境加固方案。
核心要點
- Docker Compose AI全棧部署 = LLM + 向量資料庫 + 嵌入服務 + API閘道 + 監控,一個檔案搞定
- Ollama + OpenWebUI 是最成熟的本地LLM服務方案
- Qdrant/Milvus 是向量資料庫的首選,Docker部署極簡
- GPU透傳是AI部署的關鍵,
deploy.resources.reservations.devices配置 - 生產環境需要認證、限流、監控、備份全套加固
目錄
- AI全棧架構全景
- Pattern 1: Ollama + OpenWebUI LLM服務
- Pattern 2: Qdrant/Milvus向量資料庫
- Pattern 3: 嵌入服務與模型管理
- Pattern 4: API閘道與認證
- Pattern 5: GPU透傳與資源限制
- Pattern 6: 監控與可觀測性
- Pattern 7: 生產加固與安全
- 5個常見坑及解決方案
- 10個常見報錯排查
- 進階優化技巧
- 對比分析:Docker Compose vs K8s vs Docker Swarm
- 線上工具推薦
- 總結
AI全棧架構全景
Docker Compose AI全棧部署的核心是7層架構,從底層GPU到頂層閘道,每一層都有對應的容器服務:
┌─────────────────────────────────────────────────────┐
│ API Gateway │
│ (Traefik / Nginx) │
│ 認證 · 限流 · 路由 · TLS │
├──────────┬──────────┬──────────┬────────────────────┤
│ OpenWebUI│ RAG App │ Agent │ Admin Panel │
│ (Chat) │ (檢索) │ (代理) │ (管理) │
├──────────┴──────────┴──────────┴────────────────────┤
│ Embedding Service │
│ (TEI / Infinity / FastEmbed) │
├──────────────────┬──────────────────────────────────┤
│ Ollama LLM │ vLLM / TGI │
│ (模型服務) │ (高效能推理) │
├──────────────────┴──────────────────────────────────┤
│ Vector Database │
│ (Qdrant / Milvus / Weaviate) │
├─────────────────────────────────────────────────────┤
│ Infrastructure │
│ Redis · PostgreSQL · MinIO · Prometheus │
├─────────────────────────────────────────────────────┤
│ GPU / CPU Runtime │
│ NVIDIA CUDA · ROCm · CPU Fallback │
└─────────────────────────────────────────────────────┘
完整Docker Compose AI全棧部署的目錄結構:
ai-stack/
├── docker-compose.yml
├── docker-compose.gpu.yml
├── docker-compose.prod.yml
├── .env
├── ollama/
│ └── Modelfile
├── qdrant/
│ └── config.yaml
├── traefik/
│ ├── traefik.yml
│ └── acme.json
├── monitoring/
│ ├── prometheus.yml
│ └── grafana/
│ └── dashboards/
└── scripts/
├── init-models.sh
└── backup-vectors.sh
Pattern 1: Ollama + OpenWebUI LLM服務
Ollama是2026年最成熟的本地LLM服務方案,支援Llama 4、Qwen 3、DeepSeek V3等主流模型。OpenWebUI提供ChatGPT風格的Web介面。
基礎配置
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
OLLAMA_KEEP_ALIVE: "24h"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "3"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
WEBUI_SECRET_KEY: "${WEBUI_SECRET_KEY}"
ENABLE_SIGNUP: "false"
DEFAULT_USER_ROLE: "user"
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
volumes:
ollama_data:
open_webui_data:
模型自動拉取
啟動Ollama後需要手動拉取模型,但可以透過初始化指令碼自動化:
#!/bin/bash
# scripts/init-models.sh
MODELS=(
"qwen3:8b"
"llama4:8b"
"deepseek-v3:8b"
"nomic-embed-text"
)
for model in "${MODELS[@]}"; do
echo "Pulling model: $model"
until curl -s http://localhost:11434/api/pull -d "{\"name\":\"$model\"}" | grep -q "success"; do
echo " Retrying $model..."
sleep 5
done
echo " ✓ $model ready"
done
echo "All models pulled successfully!"
在Docker Compose中加入初始化服務:
model-init:
image: curlimages/curl:latest
container_name: model-init
depends_on:
ollama:
condition: service_healthy
volumes:
- ./scripts/init-models.sh:/init-models.sh:ro
entrypoint: ["/bin/sh", "/init-models.sh"]
restart: "no"
自訂Modelfile
# ollama/Modelfile
FROM qwen3:8b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"
SYSTEM """
你是一個專業的AI助手。回答問題時:
1. 先給出簡潔的結論
2. 再提供詳細的解釋
3. 如果不確定,明確說明
"""
建構自訂模型:
docker exec ollama ollama create my-assistant -f /root/.ollama/Modelfile
Pattern 2: Qdrant/Milvus向量資料庫
向量資料庫是RAG架構的核心,Docker Compose AI全棧部署中通常選擇Qdrant(輕量)或Milvus(大規模)。
Qdrant配置(推薦小中型專案)
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_data:/qdrant/storage
- ./qdrant/config.yaml:/qdrant/config/production.yaml:ro
environment:
QDRANT__SERVICE__GRPC_PORT: "6334"
QDRANT__LOG_LEVEL: "INFO"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 3
restart: unless-stopped
Qdrant配置檔案:
# qdrant/config.yaml
storage:
performance:
max_search_threads: 4
wal:
wal_capacity_mb: 32
wal_segments_ahead: 0
optimizers:
indexing_threshold: 20000
memmap_threshold: 50000
service:
max_request_size_mb: 64
enable_cors: true
telemetry_disabled: true
Milvus配置(推薦大規模專案)
etcd:
image: quay.io/coreos/etcd:v3.5.16
container_name: milvus-etcd
environment:
ETCD_AUTO_COMPACTION_MODE: "revision"
ETCD_AUTO_COMPACTION_RETENTION: "1000"
ETCD_QUOTA_BACKEND_BYTES: "4294967296"
volumes:
- etcd_data:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379
restart: unless-stopped
minio:
image: minio/minio:latest
container_name: milvus-minio
environment:
MINIO_ACCESS_KEY: "${MINIO_ACCESS_KEY}"
MINIO_SECRET_KEY: "${MINIO_SECRET_KEY}"
ports:
- "9001:9001"
- "9000:9000"
volumes:
- minio_data:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
restart: unless-stopped
milvus:
image: milvusdb/milvus:v2.5-latest
container_name: milvus
ports:
- "19530:19530"
- "9091:9091"
volumes:
- milvus_data:/var/lib/milvus
environment:
ETCD_ENDPOINTS: "etcd:2379"
MINIO_ADDRESS: "minio:9000"
depends_on:
- etcd
- minio
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
timeout: 20s
retries: 3
start_period: 90s
restart: unless-stopped
向量資料庫對比
| 特性 | Qdrant | Milvus | Weaviate | ChromaDB |
|---|---|---|---|---|
| 部署複雜度 | 極低(單容器) | 高(3+容器) | 低(單容器) | 極低(單容器) |
| 效能(百萬級) | 優秀 | 優秀 | 良好 | 一般 |
| 效能(億級) | 良好 | 優秀 | 一般 | 不適用 |
| 過濾搜尋 | ✅ 強大 | ✅ 強大 | ✅ 良好 | ⚠️ 基礎 |
| 持久化 | ✅ | ✅ | ✅ | ⚠️ 預設記憶體 |
| 多副本 | ✅ | ✅ | ✅ | ❌ |
| gRPC支援 | ✅ | ✅ | ❌ | ❌ |
| Docker Compose適配 | ✅ 最佳 | ⚠️ 較重 | ✅ 良好 | ✅ 開發用 |
| 生產就緒 | ✅ | ✅ | ✅ | ❌ 僅開發 |
選型建議:Docker Compose AI全棧部署首選Qdrant,部署簡單、效能優秀。資料量超過1億向量時考慮Milvus。ChromaDB僅適合原型驗證。
Pattern 3: 嵌入服務與模型管理
嵌入服務將文字轉為向量,是RAG流程的關鍵環節。Docker Compose AI容器編排中有三種主流方案。
Hugging Face TEI(推薦)
tei:
image: ghcr.io/huggingface/text-embeddings-inference:latest
container_name: tei
ports:
- "8080:80"
volumes:
- tei_cache:/data
environment:
MODEL_ID: "BAAI/bge-m3"
REVISION: "main"
MAX_BATCH_TOKENS: "16384"
MAX_CLIENT_BATCH_SIZE: "32"
HF_TOKEN: "${HF_TOKEN}"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 120s
restart: unless-stopped
Infinity嵌入服務
infinity:
image: michaelf34/infinity:latest
container_name: infinity
ports:
- "7997:7997"
volumes:
- infinity_cache:/app/.cache
environment:
MODEL_ID: "BAAI/bge-m3"
ENGINE: "optimum"
BATCH_SIZE: "32"
command: >
--model-id BAAI/bge-m3
--engine optimum
--port 7997
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:7997/health"]
interval: 10s
timeout: 5s
retries: 3
restart: unless-stopped
嵌入服務呼叫範例
import httpx
import numpy as np
async def get_embeddings(texts: list[str]) -> list[list[float]]:
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
"http://tei:80/embed",
json={"inputs": texts}
)
response.raise_for_status()
return response.json()
async def search_similar(query: str, top_k: int = 5) -> list[dict]:
query_embedding = await get_embeddings([query])
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
"http://qdrant:6333/collections/documents/points/search",
json={
"vector": query_embedding[0],
"limit": top_k,
"with_payload": True
}
)
response.raise_for_status()
return response.json()["result"]
嵌入服務對比
| 特性 | TEI | Infinity | FastEmbed |
|---|---|---|---|
| GPU加速 | ✅ 原生 | ✅ 原生 | ❌ CPU |
| 批量推理 | ✅ 高效 | ✅ 高效 | ⚠️ 一般 |
| 多模型 | ✅ | ✅ | ✅ |
| Docker映像大小 | ~2GB | ~4GB | ~500MB |
| 生產就緒 | ✅ | ✅ | ⚠️ 開發用 |
| OpenAI相容API | ✅ | ✅ | ❌ |
Pattern 4: API閘道與認證
生產環境的Docker Compose AI全棧部署必須有API閘道,統一認證、限流和路由。
Traefik配置
traefik:
image: traefik:v3.2
container_name: traefik
ports:
- "80:80"
- "443:443"
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
- traefik_certs:/etc/traefik/certs
- ./traefik/dynamic:/etc/traefik/dynamic:ro
command:
- "--api.dashboard=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.file.directory=/etc/traefik/dynamic"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--entrypoints.web.http.redirections.entrypoint.to=websecure"
labels:
traefik.enable: "true"
traefik.http.routers.traefik.rule: "Host(`traefik.ai-stack.local`)"
traefik.http.routers.traefik.entrypoints: "websecure"
traefik.http.routers.traefik.tls: "true"
traefik.http.services.traefik.loadbalancer.server.port: "8080"
restart: unless-stopped
OpenWebUI接入Traefik
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- open_webui_data:/app/backend/data
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
labels:
traefik.enable: "true"
traefik.http.routers.webui.rule: "Host(`chat.ai-stack.local`)"
traefik.http.routers.webui.entrypoints: "websecure"
traefik.http.routers.webui.tls: "true"
traefik.http.services.webui.loadbalancer.server.port: "8080"
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
認證中介軟體
# traefik/dynamic/auth.yml
http:
middlewares:
auth-middleware:
forwardAuth:
address: "http://auth-service:8000/verify"
trustForwardHeader: true
authResponseHeaders:
- "X-User-Id"
- "X-User-Role"
rate-limit:
rateLimit:
average: 30
burst: 60
period: 1m
routers:
api-router:
rule: "Host(`api.ai-stack.local`)"
entrypoints:
- "websecure"
tls: true
middlewares:
- "auth-middleware"
- "rate-limit"
service: "ollama-api"
Pattern 5: GPU透傳與資源限制
AI部署的核心是GPU。Docker Compose AI全棧部署透過 deploy.resources.reservations.devices 實現GPU透傳。
NVIDIA GPU透傳
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 16G
cpus: "8.0"
environment:
NVIDIA_VISIBLE_DEVICES: "all"
NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
OLLAMA_KEEP_ALIVE: "24h"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
多GPU分配
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
environment:
CUDA_VISIBLE_DEVICES: "0"
tei:
image: ghcr.io/huggingface/text-embeddings-inference:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
environment:
CUDA_VISIBLE_DEVICES: "1"
CPU Fallback配置
ollama-cpu:
image: ollama/ollama:latest
profiles: ["cpu-only"]
container_name: ollama
volumes:
- ollama_data:/root/.ollama
environment:
OLLAMA_NUM_PARALLEL: "2"
OLLAMA_MAX_LOADED_MODELS: "1"
deploy:
resources:
limits:
memory: 8G
cpus: "4.0"
ollama-gpu:
image: ollama/ollama:latest
profiles: ["gpu"]
container_name: ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
limits:
memory: 16G
啟動方式:
# GPU模式
docker compose --profile gpu up -d
# CPU模式
docker compose --profile cpu-only up -d
GPU資源監控指令碼
import subprocess
import json
import time
def monitor_gpu_usage(interval: int = 60):
while True:
result = subprocess.run(
["nvidia-smi", "--query-gpu=index,name,memory.used,memory.total,utilization.gpu",
"--format=csv,noheader,nounits"],
capture_output=True, text=True
)
for line in result.stdout.strip().split("\n"):
idx, name, mem_used, mem_total, util = line.split(", ")
print(f"GPU {idx} ({name}): {mem_used}/{mem_total}MB, Util: {util}%")
time.sleep(interval)
if __name__ == "__main__":
monitor_gpu_usage()
Pattern 6: 監控與可觀測性
Docker Compose AI全棧部署的監控需要覆蓋GPU使用率、推理延遲、向量資料庫效能等AI特有指標。
Prometheus + Grafana
prometheus:
image: prom/prometheus:v3.2.0
container_name: prometheus
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
- "--storage.tsdb.retention.size=10GB"
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:11.5.0
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
environment:
GF_SECURITY_ADMIN_USER: "${GRAFANA_USER}"
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
GF_USERS_ALLOW_SIGN_UP: "false"
ports:
- "3001:3000"
depends_on:
- prometheus
restart: unless-stopped
dcgm-exporter:
image: nvidia/dcgm-exporter:latest
container_name: dcgm-exporter
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "9400:9400"
restart: unless-stopped
Prometheus配置
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "ollama"
static_configs:
- targets: ["ollama:11434"]
metrics_path: "/metrics"
scrape_interval: 30s
- job_name: "qdrant"
static_configs:
- targets: ["qdrant:6333"]
metrics_path: "/metrics"
scrape_interval: 30s
- job_name: "dcgm"
static_configs:
- targets: ["dcgm-exporter:9400"]
scrape_interval: 10s
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "traefik"
static_configs:
- targets: ["traefik:8080"]
關鍵告警規則
# monitoring/alerts.yml
groups:
- name: ai-stack
rules:
- alert: OllamaHighLatency
expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Ollama推理延遲過高"
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "GPU顯存使用率超過90%"
- alert: QdrantHighLatency
expr: histogram_quantile(0.95, rate(qdrant_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Qdrant查詢延遲過高"
- alert: OllamaContainerDown
expr: up{job="ollama"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Ollama服務不可用"
Pattern 7: 生產加固與安全
Docker Compose AI全棧部署上生產,安全是底線。
Secrets管理
services:
ollama:
image: ollama/ollama:latest
secrets:
- hf_token
environment:
HF_TOKEN_FILE: /run/secrets/hf_token
qdrant:
image: qdrant/qdrant:latest
secrets:
- qdrant_api_key
environment:
QDRANT__SERVICE__API_KEY_FILE: /run/secrets/qdrant_api_key
secrets:
hf_token:
file: ./secrets/hf_token.txt
qdrant_api_key:
file: ./secrets/qdrant_api_key.txt
db_password:
file: ./secrets/db_password.txt
網路隔離
networks:
frontend:
driver: bridge
backend:
driver: bridge
internal: true
monitoring:
driver: bridge
internal: true
services:
traefik:
networks:
- frontend
- backend
open-webui:
networks:
- frontend
- backend
ollama:
networks:
- backend
qdrant:
networks:
- backend
tei:
networks:
- backend
prometheus:
networks:
- monitoring
- backend
grafana:
networks:
- frontend
- monitoring
備份策略
#!/bin/bash
# scripts/backup-vectors.sh
BACKUP_DIR="/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
echo "Backing up Qdrant..."
curl -s -X POST "http://localhost:6333/snapshots" | jq .
echo "Backing up Ollama models list..."
curl -s "http://localhost:11434/api/tags" | jq . > "$BACKUP_DIR/ollama_models.json"
echo "Backing up environment config..."
cp .env "$BACKUP_DIR/.env.backup"
cp docker-compose.yml "$BACKUP_DIR/docker-compose.yml.backup"
echo "Backup completed: $BACKUP_DIR"
完整生產配置
# docker-compose.prod.yml
services:
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 16G
cpus: "8.0"
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 5
window: 120s
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"
read_only: true
tmpfs:
- /tmp
qdrant:
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
logging:
driver: json-file
options:
max-size: "50m"
max-file: "3"
5個常見坑及解決方案
坑1:Ollama模型拉取超時
現象:docker compose up 後Ollama一直卡在拉取模型,大模型(如Llama 4 70B)下載可能超過1小時。
解決方案:使用 model-init 服務非同步拉取,Ollama服務本身不需要等模型就緒。
model-init:
image: curlimages/curl:latest
depends_on:
ollama:
condition: service_healthy
volumes:
- ./scripts/init-models.sh:/init-models.sh:ro
entrypoint: ["/bin/sh", "/init-models.sh"]
restart: "no"
deploy:
resources:
limits:
memory: 256M
坑2:Qdrant容器OOM
現象:向量資料量增長後,Qdrant容器被OOM Killed。
解決方案:設定記憶體限制並啟用記憶體映射。
qdrant:
image: qdrant/qdrant:latest
deploy:
resources:
limits:
memory: 8G
environment:
QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: "4"
QDRANT__STORAGE__WAL__WAL_CAPACITY_MB: "64"
坑3:GPU驅動版本不匹配
現象:docker compose up 報錯 CUDA driver version is insufficient。
解決方案:確保宿主機NVIDIA驅動 ≥ 535,安裝 nvidia-container-toolkit。
# 檢查驅動版本
nvidia-smi | head -3
# 安裝nvidia-container-toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
坑4:嵌入服務與LLM爭搶GPU
現象:TEI和Ollama同時執行時GPU顯存不足,模型載入失敗。
解決方案:使用 device_ids 精確分配GPU,或讓嵌入服務跑CPU。
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
tei:
# CPU模式執行嵌入服務
environment:
MODEL_ID: "BAAI/bge-m3"
# 不分配GPU,使用CPU
坑5:容器間DNS解析失敗
現象:OpenWebUI報 ollama: Name or service not known。
解決方案:確保所有服務在同一網路,使用 container_name 或服務名作為主機名。
networks:
ai-network:
driver: bridge
services:
ollama:
container_name: ollama
networks:
- ai-network
open-webui:
container_name: open-webui
networks:
- ai-network
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
10個常見報錯排查
1. could not select device driver — NVIDIA執行時未安裝
# 安裝nvidia-container-toolkit後重啟Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker info | grep -i runtime
# 應該看到 nvidia runtime
2. OOM Killed — GPU顯存不足
# 檢查GPU顯存
nvidia-smi
# 減小模型或使用量化版本
docker exec ollama ollama run qwen3:4b
3. Connection refused to Ollama — 服務未就緒
# 檢查Ollama健康狀態
docker compose ps
docker compose logs ollama
# 等待healthcheck通過後再連線
4. permission denied on Docker socket
sudo usermod -aG docker $USER
newgrp docker
5. Qdrant collection not found — 集合未建立
curl -X PUT "http://localhost:6333/collections/documents" \
-H "Content-Type: application/json" \
-d '{"vectors": {"size": 1024, "distance": "Cosine"}}'
6. model not found — Ollama模型未拉取
docker exec ollama ollama pull qwen3:8b
7. CUDA out of memory — 推理時顯存溢位
# 減少並行請求數
# 在docker-compose.yml中設定
environment:
OLLAMA_NUM_PARALLEL: "1"
OLLAMA_MAX_LOADED_MODELS: "1"
8. TLS handshake error — Traefik憑證問題
# 檢查憑證檔案權限
chmod 600 traefik/acme.json
# 檢視Traefik日誌
docker compose logs traefik
9. too many open files — 檔案描述符限制
# 臨時提高限制
ulimit -n 65536
# 永久設定(/etc/security/limits.conf)
# * soft nofile 65536
# * hard nofile 65536
10. vector dimension mismatch — 嵌入維度不一致
# 確保Qdrant集合維度與嵌入模型輸出維度一致
# bge-m3: 1024維
# nomic-embed-text: 768維
curl -X PUT "http://localhost:6333/collections/documents" \
-d '{"vectors": {"size": 1024, "distance": "Cosine"}}'
進階優化技巧
多階段模型預熱
model-warmer:
image: curlimages/curl:latest
container_name: model-warmer
depends_on:
ollama:
condition: service_healthy
entrypoint: >
/bin/sh -c "
echo 'Warming up models...' &&
curl -s http://ollama:11434/api/generate -d '{\"model\":\"qwen3:8b\",\"prompt\":\"hi\",\"stream\":false}' > /dev/null &&
curl -s http://ollama:11434/api/generate -d '{\"model\":\"nomic-embed-text\",\"prompt\":\"test\",\"stream\":false}' > /dev/null &&
echo 'Models warmed up!'
"
restart: "no"
智慧模型解除安裝
ollama:
environment:
OLLAMA_KEEP_ALIVE: "5m"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "2"
OLLAMA_KEEP_ALIVE: "5m" 讓5分鐘無請求的模型自動解除安裝,釋放GPU顯存。
健康檢查鏈式依賴
tei:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 5s
retries: 5
start_period: 120s
qdrant:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 3
rag-app:
depends_on:
ollama:
condition: service_healthy
tei:
condition: service_healthy
qdrant:
condition: service_healthy
Docker Compose Watch自動開發
# docker-compose.yml
services:
rag-app:
build: .
develop:
watch:
- action: rebuild
path: ./app
target: /app
- action: sync
path: ./app/static
target: /app/static
對比分析:Docker Compose vs K8s vs Docker Swarm
| 維度 | Docker Compose | Kubernetes | Docker Swarm |
|---|---|---|---|
| AI全棧部署複雜度 | ⭐ 極低 | ⭐⭐⭐⭐⭐ 極高 | ⭐⭐ 低 |
| GPU排程 | ✅ 原生支援 | ✅ Device Plugin | ⚠️ 需配置 |
| 自動擴縮容 | ❌ | ✅ HPA | ⚠️ 手動 |
| 服務發現 | ✅ DNS | ✅ CoreDNS | ✅ DNS |
| 滾動更新 | ⚠️ 需指令碼 | ✅ 原生 | ✅ 原生 |
| 配置管理 | ✅ .env | ✅ ConfigMap | ⚠️ Config |
| Secret管理 | ✅ Docker Secret | ✅ K8s Secret | ⚠️ 基礎 |
| 監控生態 | ✅ Prometheus | ✅ 完整 | ⚠️ 有限 |
| 多節點編排 | ❌ 單機 | ✅ 核心能力 | ✅ 原生 |
| 學習曲線 | 低 | 高 | 低 |
| 社群活躍度 | ✅ 活躍 | ✅ 極活躍 | ❌ 衰退 |
| 適用AI專案規模 | 1-5 GPU | 10+ GPU | 2-5 GPU |
選型建議:Docker Compose AI全棧部署適合單機1-5 GPU的場景,是AI開發和小規模生產的最佳選擇。超過5 GPU或多節點需求時,考慮Kubernetes + KServe/vLLM。Docker Swarm不推薦用於AI部署。
線上工具推薦
- JSON格式化 - 格式化Docker Compose和API回傳的JSON資料
- Base64編碼 - 編碼Secrets和API Key等敏感配置
- cURL轉程式碼 - 將Qdrant/Ollama的cURL命令轉為Python/JS程式碼
總結
Docker Compose AI全棧部署讓AI開發環境從「三天搭不起來」變成「一條命令拉起」。Ollama + OpenWebUI搞定LLM服務,Qdrant搞定向量儲存,TEI搞定嵌入,Traefik搞定閘道,Prometheus + Grafana搞定監控,GPU透傳讓推理飛起來。7大模式覆蓋從開發到生產的全鏈路,5個常見坑和10個報錯排查幫你少走彎路。對於1-5 GPU的AI專案,Docker Compose就是2026年最實用的AI容器編排方案。
相關文章
- Docker Compose生產部署實戰 - 健康檢查、零停機更新等7大生產策略
- Python AI生產部署指南 - Python AI模型的生產部署最佳實踐
- Docker安全加固指南 - 容器安全加固與漏洞防護
外部參考
- Ollama官方文件 - Ollama模型服務完整文件
- Qdrant官方文件 - 向量資料庫部署與最佳化指南
本站提供瀏覽器本地工具,免註冊即可試用 →