Docker Compose AI全栈部署:从LLM到向量数据库的一键编排
搭个AI开发环境,三天还没跑起来?
2026年了,AI开发环境搭建依然是开发者的噩梦。你要装Ollama跑LLM,配Qdrant存向量,搭嵌入服务做Embedding,搞API网关做认证,还得处理GPU驱动、CUDA版本、模型下载……一套下来三天过去了,代码还没写一行。
Docker Compose AI全栈部署可以把这些全部编排到一个文件里,docker compose up -d 一键拉起整个AI技术栈。本文是Docker Compose AI容器编排的完整实战指南,覆盖7大核心模式、5个常见坑、10个报错排查,以及生产环境加固方案。
核心要点
- Docker Compose AI全栈部署 = LLM + 向量数据库 + 嵌入服务 + API网关 + 监控,一个文件搞定
- Ollama + OpenWebUI 是最成熟的本地LLM服务方案
- Qdrant/Milvus 是向量数据库的首选,Docker部署极简
- GPU透传是AI部署的关键,
deploy.resources.reservations.devices配置 - 生产环境需要认证、限流、监控、备份全套加固
目录
- AI全栈架构全景
- Pattern 1: Ollama + OpenWebUI LLM服务
- Pattern 2: Qdrant/Milvus向量数据库
- Pattern 3: 嵌入服务与模型管理
- Pattern 4: API网关与认证
- Pattern 5: GPU透传与资源限制
- Pattern 6: 监控与可观测性
- Pattern 7: 生产加固与安全
- 5个常见坑及解决方案
- 10个常见报错排查
- 进阶优化技巧
- 对比分析:Docker Compose vs K8s vs Docker Swarm
- 在线工具推荐
- 总结
AI全栈架构全景
Docker Compose AI全栈部署的核心是7层架构,从底层GPU到顶层网关,每一层都有对应的容器服务:
┌─────────────────────────────────────────────────────┐
│ API Gateway │
│ (Traefik / Nginx) │
│ 认证 · 限流 · 路由 · TLS │
├──────────┬──────────┬──────────┬────────────────────┤
│ OpenWebUI│ RAG App │ Agent │ Admin Panel │
│ (Chat) │ (检索) │ (代理) │ (管理) │
├──────────┴──────────┴──────────┴────────────────────┤
│ Embedding Service │
│ (TEI / Infinity / FastEmbed) │
├──────────────────┬──────────────────────────────────┤
│ Ollama LLM │ vLLM / TGI │
│ (模型服务) │ (高性能推理) │
├──────────────────┴──────────────────────────────────┤
│ Vector Database │
│ (Qdrant / Milvus / Weaviate) │
├─────────────────────────────────────────────────────┤
│ Infrastructure │
│ Redis · PostgreSQL · MinIO · Prometheus │
├─────────────────────────────────────────────────────┤
│ GPU / CPU Runtime │
│ NVIDIA CUDA · ROCm · CPU Fallback │
└─────────────────────────────────────────────────────┘
完整Docker Compose AI全栈部署的目录结构:
ai-stack/
├── docker-compose.yml
├── docker-compose.gpu.yml
├── docker-compose.prod.yml
├── .env
├── ollama/
│ └── Modelfile
├── qdrant/
│ └── config.yaml
├── traefik/
│ ├── traefik.yml
│ └── acme.json
├── monitoring/
│ ├── prometheus.yml
│ └── grafana/
│ └── dashboards/
└── scripts/
├── init-models.sh
└── backup-vectors.sh
Pattern 1: Ollama + OpenWebUI LLM服务
Ollama是2026年最成熟的本地LLM服务方案,支持Llama 4、Qwen 3、DeepSeek V3等主流模型。OpenWebUI提供ChatGPT风格的Web界面。
基础配置
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
OLLAMA_KEEP_ALIVE: "24h"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "3"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
WEBUI_SECRET_KEY: "${WEBUI_SECRET_KEY}"
ENABLE_SIGNUP: "false"
DEFAULT_USER_ROLE: "user"
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
volumes:
ollama_data:
open_webui_data:
模型自动拉取
启动Ollama后需要手动拉取模型,但可以通过初始化脚本自动化:
#!/bin/bash
# scripts/init-models.sh
MODELS=(
"qwen3:8b"
"llama4:8b"
"deepseek-v3:8b"
"nomic-embed-text"
)
for model in "${MODELS[@]}"; do
echo "Pulling model: $model"
until curl -s http://localhost:11434/api/pull -d "{\"name\":\"$model\"}" | grep -q "success"; do
echo " Retrying $model..."
sleep 5
done
echo " ✓ $model ready"
done
echo "All models pulled successfully!"
在Docker Compose中加入初始化服务:
model-init:
image: curlimages/curl:latest
container_name: model-init
depends_on:
ollama:
condition: service_healthy
volumes:
- ./scripts/init-models.sh:/init-models.sh:ro
entrypoint: ["/bin/sh", "/init-models.sh"]
restart: "no"
自定义Modelfile
# ollama/Modelfile
FROM qwen3:8b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"
SYSTEM """
你是一个专业的AI助手。回答问题时:
1. 先给出简洁的结论
2. 再提供详细的解释
3. 如果不确定,明确说明
"""
构建自定义模型:
docker exec ollama ollama create my-assistant -f /root/.ollama/Modelfile
Pattern 2: Qdrant/Milvus向量数据库
向量数据库是RAG架构的核心,Docker Compose AI全栈部署中通常选择Qdrant(轻量)或Milvus(大规模)。
Qdrant配置(推荐小中型项目)
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_data:/qdrant/storage
- ./qdrant/config.yaml:/qdrant/config/production.yaml:ro
environment:
QDRANT__SERVICE__GRPC_PORT: "6334"
QDRANT__LOG_LEVEL: "INFO"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 3
restart: unless-stopped
Qdrant配置文件:
# qdrant/config.yaml
storage:
performance:
max_search_threads: 4
wal:
wal_capacity_mb: 32
wal_segments_ahead: 0
optimizers:
indexing_threshold: 20000
memmap_threshold: 50000
service:
max_request_size_mb: 64
enable_cors: true
telemetry_disabled: true
Milvus配置(推荐大规模项目)
etcd:
image: quay.io/coreos/etcd:v3.5.16
container_name: milvus-etcd
environment:
ETCD_AUTO_COMPACTION_MODE: "revision"
ETCD_AUTO_COMPACTION_RETENTION: "1000"
ETCD_QUOTA_BACKEND_BYTES: "4294967296"
volumes:
- etcd_data:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379
restart: unless-stopped
minio:
image: minio/minio:latest
container_name: milvus-minio
environment:
MINIO_ACCESS_KEY: "${MINIO_ACCESS_KEY}"
MINIO_SECRET_KEY: "${MINIO_SECRET_KEY}"
ports:
- "9001:9001"
- "9000:9000"
volumes:
- minio_data:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
restart: unless-stopped
milvus:
image: milvusdb/milvus:v2.5-latest
container_name: milvus
ports:
- "19530:19530"
- "9091:9091"
volumes:
- milvus_data:/var/lib/milvus
environment:
ETCD_ENDPOINTS: "etcd:2379"
MINIO_ADDRESS: "minio:9000"
depends_on:
- etcd
- minio
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
timeout: 20s
retries: 3
start_period: 90s
restart: unless-stopped
向量数据库对比
| 特性 | Qdrant | Milvus | Weaviate | ChromaDB |
|---|---|---|---|---|
| 部署复杂度 | 极低(单容器) | 高(3+容器) | 低(单容器) | 极低(单容器) |
| 性能(百万级) | 优秀 | 优秀 | 良好 | 一般 |
| 性能(亿级) | 良好 | 优秀 | 一般 | 不适用 |
| 过滤搜索 | ✅ 强大 | ✅ 强大 | ✅ 良好 | ⚠️ 基础 |
| 持久化 | ✅ | ✅ | ✅ | ⚠️ 默认内存 |
| 多副本 | ✅ | ✅ | ✅ | ❌ |
| gRPC支持 | ✅ | ✅ | ❌ | ❌ |
| Docker Compose适配 | ✅ 最佳 | ⚠️ 较重 | ✅ 良好 | ✅ 开发用 |
| 生产就绪 | ✅ | ✅ | ✅ | ❌ 仅开发 |
选型建议:Docker Compose AI全栈部署首选Qdrant,部署简单、性能优秀。数据量超过1亿向量时考虑Milvus。ChromaDB仅适合原型验证。
Pattern 3: 嵌入服务与模型管理
嵌入服务将文本转为向量,是RAG流程的关键环节。Docker Compose AI容器编排中有三种主流方案。
Hugging Face TEI(推荐)
tei:
image: ghcr.io/huggingface/text-embeddings-inference:latest
container_name: tei
ports:
- "8080:80"
volumes:
- tei_cache:/data
environment:
MODEL_ID: "BAAI/bge-m3"
REVISION: "main"
MAX_BATCH_TOKENS: "16384"
MAX_CLIENT_BATCH_SIZE: "32"
HF_TOKEN: "${HF_TOKEN}"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 120s
restart: unless-stopped
Infinity嵌入服务
infinity:
image: michaelf34/infinity:latest
container_name: infinity
ports:
- "7997:7997"
volumes:
- infinity_cache:/app/.cache
environment:
MODEL_ID: "BAAI/bge-m3"
ENGINE: "optimum"
BATCH_SIZE: "32"
command: >
--model-id BAAI/bge-m3
--engine optimum
--port 7997
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:7997/health"]
interval: 10s
timeout: 5s
retries: 3
restart: unless-stopped
嵌入服务调用示例
import httpx
import numpy as np
async def get_embeddings(texts: list[str]) -> list[list[float]]:
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
"http://tei:80/embed",
json={"inputs": texts}
)
response.raise_for_status()
return response.json()
async def search_similar(query: str, top_k: int = 5) -> list[dict]:
query_embedding = await get_embeddings([query])
async with httpx.AsyncClient(timeout=30.0) as client:
response = await client.post(
"http://qdrant:6333/collections/documents/points/search",
json={
"vector": query_embedding[0],
"limit": top_k,
"with_payload": True
}
)
response.raise_for_status()
return response.json()["result"]
嵌入服务对比
| 特性 | TEI | Infinity | FastEmbed |
|---|---|---|---|
| GPU加速 | ✅ 原生 | ✅ 原生 | ❌ CPU |
| 批量推理 | ✅ 高效 | ✅ 高效 | ⚠️ 一般 |
| 多模型 | ✅ | ✅ | ✅ |
| Docker镜像大小 | ~2GB | ~4GB | ~500MB |
| 生产就绪 | ✅ | ✅ | ⚠️ 开发用 |
| OpenAI兼容API | ✅ | ✅ | ❌ |
Pattern 4: API网关与认证
生产环境的Docker Compose AI全栈部署必须有API网关,统一认证、限流和路由。
Traefik配置
traefik:
image: traefik:v3.2
container_name: traefik
ports:
- "80:80"
- "443:443"
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
- traefik_certs:/etc/traefik/certs
- ./traefik/dynamic:/etc/traefik/dynamic:ro
command:
- "--api.dashboard=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.file.directory=/etc/traefik/dynamic"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--entrypoints.web.http.redirections.entrypoint.to=websecure"
labels:
traefik.enable: "true"
traefik.http.routers.traefik.rule: "Host(`traefik.ai-stack.local`)"
traefik.http.routers.traefik.entrypoints: "websecure"
traefik.http.routers.traefik.tls: "true"
traefik.http.services.traefik.loadbalancer.server.port: "8080"
restart: unless-stopped
OpenWebUI接入Traefik
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- open_webui_data:/app/backend/data
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
labels:
traefik.enable: "true"
traefik.http.routers.webui.rule: "Host(`chat.ai-stack.local`)"
traefik.http.routers.webui.entrypoints: "websecure"
traefik.http.routers.webui.tls: "true"
traefik.http.services.webui.loadbalancer.server.port: "8080"
depends_on:
ollama:
condition: service_healthy
restart: unless-stopped
认证中间件
# traefik/dynamic/auth.yml
http:
middlewares:
auth-middleware:
forwardAuth:
address: "http://auth-service:8000/verify"
trustForwardHeader: true
authResponseHeaders:
- "X-User-Id"
- "X-User-Role"
rate-limit:
rateLimit:
average: 30
burst: 60
period: 1m
routers:
api-router:
rule: "Host(`api.ai-stack.local`)"
entrypoints:
- "websecure"
tls: true
middlewares:
- "auth-middleware"
- "rate-limit"
service: "ollama-api"
Pattern 5: GPU透传与资源限制
AI部署的核心是GPU。Docker Compose AI全栈部署通过 deploy.resources.reservations.devices 实现GPU透传。
NVIDIA GPU透传
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 16G
cpus: "8.0"
environment:
NVIDIA_VISIBLE_DEVICES: "all"
NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
OLLAMA_KEEP_ALIVE: "24h"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
多GPU分配
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
environment:
CUDA_VISIBLE_DEVICES: "0"
tei:
image: ghcr.io/huggingface/text-embeddings-inference:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
environment:
CUDA_VISIBLE_DEVICES: "1"
CPU Fallback配置
ollama-cpu:
image: ollama/ollama:latest
profiles: ["cpu-only"]
container_name: ollama
volumes:
- ollama_data:/root/.ollama
environment:
OLLAMA_NUM_PARALLEL: "2"
OLLAMA_MAX_LOADED_MODELS: "1"
deploy:
resources:
limits:
memory: 8G
cpus: "4.0"
ollama-gpu:
image: ollama/ollama:latest
profiles: ["gpu"]
container_name: ollama
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:: [gpu]
limits:
memory: 16G
启动方式:
# GPU模式
docker compose --profile gpu up -d
# CPU模式
docker compose --profile cpu-only up -d
GPU资源监控脚本
import subprocess
import json
import time
def monitor_gpu_usage(interval: int = 60):
while True:
result = subprocess.run(
["nvidia-smi", "--query-gpu=index,name,memory.used,memory.total,utilization.gpu",
"--format=csv,noheader,nounits"],
capture_output=True, text=True
)
for line in result.stdout.strip().split("\n"):
idx, name, mem_used, mem_total, util = line.split(", ")
print(f"GPU {idx} ({name}): {mem_used}/{mem_total}MB, Util: {util}%")
time.sleep(interval)
if __name__ == "__main__":
monitor_gpu_usage()
Pattern 6: 监控与可观测性
Docker Compose AI全栈部署的监控需要覆盖GPU使用率、推理延迟、向量数据库性能等AI特有指标。
Prometheus + Grafana
prometheus:
image: prom/prometheus:v3.2.0
container_name: prometheus
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
- "--storage.tsdb.retention.size=10GB"
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:11.5.0
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
environment:
GF_SECURITY_ADMIN_USER: "${GRAFANA_USER}"
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASSWORD}"
GF_USERS_ALLOW_SIGN_UP: "false"
ports:
- "3001:3000"
depends_on:
- prometheus
restart: unless-stopped
dcgm-exporter:
image: nvidia/dcgm-exporter:latest
container_name: dcgm-exporter
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "9400:9400"
restart: unless-stopped
Prometheus配置
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "ollama"
static_configs:
- targets: ["ollama:11434"]
metrics_path: "/metrics"
scrape_interval: 30s
- job_name: "qdrant"
static_configs:
- targets: ["qdrant:6333"]
metrics_path: "/metrics"
scrape_interval: 30s
- job_name: "dcgm"
static_configs:
- targets: ["dcgm-exporter:9400"]
scrape_interval: 10s
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "traefik"
static_configs:
- targets: ["traefik:8080"]
关键告警规则
# monitoring/alerts.yml
groups:
- name: ai-stack
rules:
- alert: OllamaHighLatency
expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Ollama推理延迟过高"
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "GPU显存使用率超过90%"
- alert: QdrantHighLatency
expr: histogram_quantile(0.95, rate(qdrant_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Qdrant查询延迟过高"
- alert: OllamaContainerDown
expr: up{job="ollama"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Ollama服务不可用"
Pattern 7: 生产加固与安全
Docker Compose AI全栈部署上生产,安全是底线。
Secrets管理
services:
ollama:
image: ollama/ollama:latest
secrets:
- hf_token
environment:
HF_TOKEN_FILE: /run/secrets/hf_token
qdrant:
image: qdrant/qdrant:latest
secrets:
- qdrant_api_key
environment:
QDRANT__SERVICE__API_KEY_FILE: /run/secrets/qdrant_api_key
secrets:
hf_token:
file: ./secrets/hf_token.txt
qdrant_api_key:
file: ./secrets/qdrant_api_key.txt
db_password:
file: ./secrets/db_password.txt
网络隔离
networks:
frontend:
driver: bridge
backend:
driver: bridge
internal: true
monitoring:
driver: bridge
internal: true
services:
traefik:
networks:
- frontend
- backend
open-webui:
networks:
- frontend
- backend
ollama:
networks:
- backend
qdrant:
networks:
- backend
tei:
networks:
- backend
prometheus:
networks:
- monitoring
- backend
grafana:
networks:
- frontend
- monitoring
备份策略
#!/bin/bash
# scripts/backup-vectors.sh
BACKUP_DIR="/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
echo "Backing up Qdrant..."
curl -s -X POST "http://localhost:6333/snapshots" | jq .
echo "Backing up Ollama models list..."
curl -s "http://localhost:11434/api/tags" | jq . > "$BACKUP_DIR/ollama_models.json"
echo "Backing up environment config..."
cp .env "$BACKUP_DIR/.env.backup"
cp docker-compose.yml "$BACKUP_DIR/docker-compose.yml.backup"
echo "Backup completed: $BACKUP_DIR"
完整生产配置
# docker-compose.prod.yml
services:
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 16G
cpus: "8.0"
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 5
window: 120s
logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"
read_only: true
tmpfs:
- /tmp
qdrant:
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
logging:
driver: json-file
options:
max-size: "50m"
max-file: "3"
5个常见坑及解决方案
坑1:Ollama模型拉取超时
现象:docker compose up 后Ollama一直卡在拉取模型,大模型(如Llama 4 70B)下载可能超过1小时。
解决方案:使用 model-init 服务异步拉取,Ollama服务本身不需要等模型就绪。
model-init:
image: curlimages/curl:latest
depends_on:
ollama:
condition: service_healthy
volumes:
- ./scripts/init-models.sh:/init-models.sh:ro
entrypoint: ["/bin/sh", "/init-models.sh"]
restart: "no"
deploy:
resources:
limits:
memory: 256M
坑2:Qdrant容器OOM
现象:向量数据量增长后,Qdrant容器被OOM Killed。
解决方案:设置内存限制并启用内存映射。
qdrant:
image: qdrant/qdrant:latest
deploy:
resources:
limits:
memory: 8G
environment:
QDRANT__STORAGE__PERFORMANCE__MAX_SEARCH_THREADS: "4"
QDRANT__STORAGE__WAL__WAL_CAPACITY_MB: "64"
坑3:GPU驱动版本不匹配
现象:docker compose up 报错 CUDA driver version is insufficient。
解决方案:确保宿主机NVIDIA驱动 ≥ 535,安装 nvidia-container-toolkit。
# 检查驱动版本
nvidia-smi | head -3
# 安装nvidia-container-toolkit (Ubuntu)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
坑4:嵌入服务与LLM争抢GPU
现象:TEI和Ollama同时运行时GPU显存不足,模型加载失败。
解决方案:使用 device_ids 精确分配GPU,或让嵌入服务跑CPU。
ollama:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
tei:
# CPU模式运行嵌入服务
environment:
MODEL_ID: "BAAI/bge-m3"
# 不分配GPU,使用CPU
坑5:容器间DNS解析失败
现象:OpenWebUI报 ollama: Name or service not known。
解决方案:确保所有服务在同一网络,使用 container_name 或服务名作为主机名。
networks:
ai-network:
driver: bridge
services:
ollama:
container_name: ollama
networks:
- ai-network
open-webui:
container_name: open-webui
networks:
- ai-network
environment:
OLLAMA_BASE_URL: "http://ollama:11434"
10个常见报错排查
1. could not select device driver — NVIDIA运行时未安装
# 安装nvidia-container-toolkit后重启Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker info | grep -i runtime
# 应该看到 nvidia runtime
2. OOM Killed — GPU显存不足
# 检查GPU显存
nvidia-smi
# 减小模型或使用量化版本
docker exec ollama ollama run qwen3:4b
3. Connection refused to Ollama — 服务未就绪
# 检查Ollama健康状态
docker compose ps
docker compose logs ollama
# 等待healthcheck通过后再连接
4. permission denied on Docker socket
sudo usermod -aG docker $USER
newgrp docker
5. Qdrant collection not found — 集合未创建
curl -X PUT "http://localhost:6333/collections/documents" \
-H "Content-Type: application/json" \
-d '{"vectors": {"size": 1024, "distance": "Cosine"}}'
6. model not found — Ollama模型未拉取
docker exec ollama ollama pull qwen3:8b
7. CUDA out of memory — 推理时显存溢出
# 减少并行请求数
# 在docker-compose.yml中设置
environment:
OLLAMA_NUM_PARALLEL: "1"
OLLAMA_MAX_LOADED_MODELS: "1"
8. TLS handshake error — Traefik证书问题
# 检查证书文件权限
chmod 600 traefik/acme.json
# 查看Traefik日志
docker compose logs traefik
9. too many open files — 文件描述符限制
# 临时提高限制
ulimit -n 65536
# 永久设置(/etc/security/limits.conf)
# * soft nofile 65536
# * hard nofile 65536
10. vector dimension mismatch — 嵌入维度不一致
# 确保Qdrant集合维度与嵌入模型输出维度一致
# bge-m3: 1024维
# nomic-embed-text: 768维
curl -X PUT "http://localhost:6333/collections/documents" \
-d '{"vectors": {"size": 1024, "distance": "Cosine"}}'
进阶优化技巧
多阶段模型预热
model-warmer:
image: curlimages/curl:latest
container_name: model-warmer
depends_on:
ollama:
condition: service_healthy
entrypoint: >
/bin/sh -c "
echo 'Warming up models...' &&
curl -s http://ollama:11434/api/generate -d '{\"model\":\"qwen3:8b\",\"prompt\":\"hi\",\"stream\":false}' > /dev/null &&
curl -s http://ollama:11434/api/generate -d '{\"model\":\"nomic-embed-text\",\"prompt\":\"test\",\"stream\":false}' > /dev/null &&
echo 'Models warmed up!'
"
restart: "no"
智能模型卸载
ollama:
environment:
OLLAMA_KEEP_ALIVE: "5m"
OLLAMA_NUM_PARALLEL: "4"
OLLAMA_MAX_LOADED_MODELS: "2"
OLLAMA_KEEP_ALIVE: "5m" 让5分钟无请求的模型自动卸载,释放GPU显存。
健康检查链式依赖
tei:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 5s
retries: 5
start_period: 120s
qdrant:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 3
rag-app:
depends_on:
ollama:
condition: service_healthy
tei:
condition: service_healthy
qdrant:
condition: service_healthy
Docker Compose Watch自动开发
# docker-compose.yml
services:
rag-app:
build: .
develop:
watch:
- action: rebuild
path: ./app
target: /app
- action: sync
path: ./app/static
target: /app/static
对比分析:Docker Compose vs K8s vs Docker Swarm
| 维度 | Docker Compose | Kubernetes | Docker Swarm |
|---|---|---|---|
| AI全栈部署复杂度 | ⭐ 极低 | ⭐⭐⭐⭐⭐ 极高 | ⭐⭐ 低 |
| GPU调度 | ✅ 原生支持 | ✅ Device Plugin | ⚠️ 需配置 |
| 自动扩缩容 | ❌ | ✅ HPA | ⚠️ 手动 |
| 服务发现 | ✅ DNS | ✅ CoreDNS | ✅ DNS |
| 滚动更新 | ⚠️ 需脚本 | ✅ 原生 | ✅ 原生 |
| 配置管理 | ✅ .env | ✅ ConfigMap | ⚠️ Config |
| Secret管理 | ✅ Docker Secret | ✅ K8s Secret | ⚠️ 基础 |
| 监控生态 | ✅ Prometheus | ✅ 完整 | ⚠️ 有限 |
| 多节点编排 | ❌ 单机 | ✅ 核心能力 | ✅ 原生 |
| 学习曲线 | 低 | 高 | 低 |
| 社区活跃度 | ✅ 活跃 | ✅ 极活跃 | ❌ 衰退 |
| 适用场景 | AI开发/小规模生产 | 大规模生产 | 简单多节点 |
| 推荐AI项目规模 | 1-5 GPU | 10+ GPU | 2-5 GPU |
选型建议:Docker Compose AI全栈部署适合单机1-5 GPU的场景,是AI开发和小规模生产的最佳选择。超过5 GPU或多节点需求时,考虑Kubernetes + KServe/vLLM。Docker Swarm不推荐用于AI部署。
在线工具推荐
- JSON格式化 - 格式化Docker Compose和API返回的JSON数据
- Base64编码 - 编码Secrets和API Key等敏感配置
- cURL转代码 - 将Qdrant/Ollama的cURL命令转为Python/JS代码
总结
Docker Compose AI全栈部署让AI开发环境从"三天搭不起来"变成"一条命令拉起"。Ollama + OpenWebUI搞定LLM服务,Qdrant搞定向量存储,TEI搞定嵌入,Traefik搞定网关,Prometheus + Grafana搞定监控,GPU透传让推理飞起来。7大模式覆盖从开发到生产的全链路,5个常见坑和10个报错排查帮你少走弯路。对于1-5 GPU的AI项目,Docker Compose就是2026年最实用的AI容器编排方案。
相关文章
- Docker Compose生产部署实战 - 健康检查、零停机更新等7大生产策略
- Python AI生产部署指南 - Python AI模型的生产部署最佳实践
- Docker安全加固指南 - 容器安全加固与漏洞防护
外部参考
- Ollama官方文档 - Ollama模型服务完整文档
- Qdrant官方文档 - 向量数据库部署与优化指南
本站提供浏览器本地工具,免注册即可试用 →