Docker Compose生产部署实战:从健康检查到零停机更新的7大关键策略
"在我机器上能跑啊"——生产环境的容器坟场
每个开发者的口头禅:"在我机器上能跑啊"。但当容器被部署到生产环境后,噩梦才真正开始:
- 容器悄无声息地OOM Killed,日志里只剩一行
Out of memory - 数据库还没启动完成,应用容器已经疯狂报错
Connection refused - 凌晨3点容器挂了,没有重启策略,服务直接中断到天亮
- 日志文件撑爆磁盘,
docker logs输出几十GB的无结构文本 - 线上配置明文写在
docker-compose.yml,数据库密码裸奔
如果你在生产环境还在用 docker compose up -d 一把梭,这篇文章就是为你准备的。
核心概念速查表
| 概念 | 作用 | 生产环境关键配置 |
|---|---|---|
| Health Check | 检测容器是否真正可用 | healthcheck + depends_on.condition |
| Resource Limits | 限制CPU/内存,防止资源抢占 | deploy.resources.limits |
| Restart Policy | 容器异常退出后自动重启 | deploy.restart_policy |
| Secrets | 敏感信息加密存储 | secrets + Docker Secret |
| Logging Driver | 结构化日志 + 日志轮转 | logging.driver + logging.options |
| Profiles | 按环境选择性启动服务 | profiles |
| Watch | 文件变更自动同步到容器 | watch (Compose Watch) |
生产环境的5大挑战
挑战1:容器启动顺序不可控
数据库还在初始化,应用容器已经尝试连接,导致启动失败。depends_on 只保证启动顺序,不保证服务就绪。
挑战2:资源无限膨胀
没有资源限制的容器就像没有刹车的汽车。一个内存泄漏的容器能吃掉整个宿主机的内存,拖垮所有服务。
挑战3:日志黑洞
默认的 json-file 日志驱动不会自动轮转。运行3个月后,/var/lib/docker 占满磁盘,服务全部崩溃。
挑战4:敏感信息泄露
environment 里明文写数据库密码,.env 文件被提交到Git仓库,镜像里硬编码API Key——这些都是生产事故的定时炸弹。
挑战5:更新即停机
docker compose up -d 默认先停旧容器再启新容器,更新期间服务不可用。对于7×24小时的服务,这不可接受。
7大实战模式
模式1:健康检查与依赖排序
问题:depends_on 只控制启动顺序,不保证服务真正就绪。
解决方案:使用 healthcheck + depends_on.condition: service_healthy。
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
app:
image: myapp:latest
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
ports:
- "3000:3000"
关键参数说明:
interval:检查间隔,生产环境建议5-10秒timeout:单次检查超时,建议3-5秒retries:连续失败次数后标记为unhealthystart_period:容器启动后的宽限期,期间失败不计入retries
模式2:资源限制与OOM保护
问题:无限制的容器会抢占宿主机资源,一个失控的容器能拖垮整台机器。
解决方案:使用 deploy.resources 设置CPU和内存的limits与reservations。
services:
app:
image: myapp:latest
deploy:
resources:
limits:
cpus: '2.0'
memory: 512M
reservations:
cpus: '0.5'
memory: 256M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
worker:
image: myworker:latest
deploy:
resources:
limits:
cpus: '1.0'
memory: 1G
reservations:
memory: 512M
limits vs reservations:
limits:硬上限,超过会被OOM Kill或CPU限流reservations:软保证,调度器尽量满足,但不强制
OOM保护策略:
services:
critical-service:
image: critical-app:latest
deploy:
resources:
limits:
memory: 1G
# 在容器内设置OOM Score,降低被系统优先杀掉的概率
# 需要在Dockerfile或启动脚本中设置
cap_add:
- SYS_PTRACE
在宿主机层面,可以设置 vm.overcommit_memory 和调整OOM策略:
# 查看容器OOM Score
docker inspect --format='{{.State.OOMKilled}}' <container_id>
# 设置宿主机OOM策略:不杀关键进程
echo -1000 > /proc/<pid>/oom_score_adj
模式3:结构化日志与日志轮转
问题:默认 json-file 日志驱动不轮转,磁盘会被撑爆。
解决方案:配置日志驱动 + 轮转策略,生产环境推荐 local 驱动。
services:
app:
image: myapp:latest
logging:
driver: local
options:
max-size: "10m"
max-file: "5"
tag: "{{.Name}}/{{.ID}}"
nginx:
image: nginx:1.27-alpine
logging:
driver: json-file
options:
max-size: "50m"
max-file: "10"
tag: "nginx/{{.Name}}"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
日志驱动对比:
| 驱动 | 适用场景 | 优点 | 缺点 |
|---|---|---|---|
| local | 生产环境默认推荐 | 自动轮转,压缩存储 | 仅本地存储 |
| json-file | 需要结构化JSON日志 | Docker原生支持 | 需手动配轮转 |
| syslog | 集中式日志收集 | 可发送到远程 | 配置复杂 |
| fluentd | 与EFK栈集成 | 灵活的日志路由 | 需额外部署Fluentd |
应用层结构化日志:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
# 设置时区
ENV TZ=Asia/Shanghai
# JSON格式日志输出
ENV LOG_FORMAT=json
CMD ["node", "server.js"]
// 应用层使用结构化日志
const logger = {
info: (msg, meta = {}) => {
console.log(JSON.stringify({ level: 'info', msg, ts: new Date().toISOString(), ...meta }));
},
error: (msg, meta = {}) => {
console.error(JSON.stringify({ level: 'error', msg, ts: new Date().toISOString(), ...meta }));
}
};
模式4:Secrets管理
问题:敏感信息明文写在compose文件或环境变量中。
解决方案:Docker Secrets + _FILE 后缀环境变量。
secrets:
db_password:
file: ./secrets/db_password.txt
api_key:
file: ./secrets/api_key.txt
jwt_secret:
file: ./secrets/jwt_secret.txt
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
app:
image: myapp:latest
environment:
DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
API_KEY_FILE: /run/secrets/api_key
JWT_SECRET_FILE: /run/secrets/jwt_secret
secrets:
- api_key
- jwt_secret
Secrets文件管理:
# 创建secrets目录
mkdir -p secrets
chmod 700 secrets
# 写入secret文件
echo "my-super-secret-password-2026" > secrets/db_password.txt
echo "ak-live-xxxx-yyyy-zzzz" > secrets/api_key.txt
echo "jwt-hs256-secret-key-here" > secrets/jwt_secret.txt
# 设置权限:仅root可读
chmod 600 secrets/*.txt
.gitignore 必须包含:
secrets/
*.secret
.env.production
.env.staging
Docker Secrets vs .env 对比:
| 特性 | Docker Secrets | .env文件 |
|---|---|---|
| 加密存储 | 是(Swarm模式下) | 否 |
| 文件权限 | 受限(/run/secrets/) | 依赖文件系统权限 |
| 审计追踪 | 有 | 无 |
| 跨节点同步 | Swarm自动同步 | 需手动分发 |
| 适用场景 | Swarm/单机生产环境 | 开发环境 |
模式5:零停机滚动更新
问题:docker compose up -d 默认先停旧再启新,更新期间服务不可用。
解决方案:使用 docker compose up --no-down + 健康检查 + 反向代理。
services:
app:
image: myapp:${APP_VERSION:-latest}
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 10s
order: start-first
failure_action: rollback
rollback_config:
parallelism: 0
order: stop-first
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 5s
timeout: 3s
retries: 3
start_period: 15s
labels:
- "com.toolsku.app=true"
ports:
- "3000-3001:3000"
nginx:
image: nginx:1.27-alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
app:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 3s
retries: 3
Nginx反向代理配置(自动发现上游):
upstream app_backend {
server app:3000;
# 多副本时Nginx自动负载均衡
}
server {
listen 80;
server_name app.example.com;
location / {
proxy_pass http://app_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# 健康检查端点
proxy_next_upstream error timeout http_502 http_503;
}
location /health {
access_log off;
return 200 'ok';
add_header Content-Type text/plain;
}
}
零停机更新脚本:
#!/bin/bash
set -euo pipefail
NEW_IMAGE="myapp:v2.0.0"
echo "🚀 Starting zero-downtime update to ${NEW_IMAGE}"
# 1. 拉取新镜像
docker compose pull app
# 2. 启动新容器(不停旧容器)
docker compose up -d --no-deps --scale app=2 app
# 3. 等待新容器健康
echo "⏳ Waiting for new containers to be healthy..."
sleep 15
# 4. 验证新容器健康
for i in $(seq 1 30); do
if curl -sf http://localhost:3000/health > /dev/null 2>&1; then
echo "✅ New container is healthy"
break
fi
if [ $i -eq 30 ]; then
echo "❌ Health check failed, rolling back..."
docker compose up -d --no-deps --scale app=1 app
exit 1
fi
sleep 2
done
# 5. 缩减到1个副本
docker compose up -d --no-deps --scale app=1 app
echo "🎉 Update completed successfully"
模式6:Prometheus + Grafana监控栈
问题:生产环境没有监控就是盲飞,出了问题只能靠用户反馈。
解决方案:部署完整的Prometheus + Grafana监控栈。
services:
prometheus:
image: prom/prometheus:v2.52.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=5GB'
deploy:
resources:
limits:
memory: 512M
networks:
- monitoring
grafana:
image: grafana/grafana:11.0.0
container_name: grafana
restart: unless-stopped
ports:
- "3001:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
secrets:
- grafana_password
volumes:
- grafana_data:/var/lib/grafana
deploy:
resources:
limits:
memory: 256M
depends_on:
- prometheus
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.8.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
networks:
- monitoring
secrets:
grafana_password:
file: ./secrets/grafana_password.txt
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
Prometheus配置:
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app'
static_configs:
- targets: ['app:3000']
metrics_path: /metrics
告警规则:
# monitoring/alert_rules.yml
groups:
- name: container_alerts
rules:
- alert: ContainerDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.instance }} is down"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / (1024 * 1024) > 400
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage exceeds 400MB on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
模式7:多环境配置(dev/staging/prod)
问题:开发、测试、生产环境配置混杂,改一个环境配置怕影响其他环境。
解决方案:使用 docker-compose.override.yml + 多文件覆盖策略。
目录结构:
project/
├── docker-compose.yml # 基础配置
├── docker-compose.override.yml # 开发环境覆盖(自动加载)
├── docker-compose.staging.yml # 测试环境覆盖
├── docker-compose.prod.yml # 生产环境覆盖
├── .env # 默认环境变量
├── .env.staging # 测试环境变量
├── .env.prod # 生产环境变量
├── monitoring/
│ ├── prometheus.yml
│ └── alertmanager.yml
└── secrets/
├── db_password.txt
├── api_key.txt
└── grafana_password.txt
基础配置 docker-compose.yml:
services:
app:
image: myapp:${APP_VERSION:-latest}
environment:
NODE_ENV: ${NODE_ENV:-development}
DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
REDIS_URL: redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
networks:
- app-network
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 5s
timeout: 3s
retries: 5
networks:
- app-network
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
networks:
- app-network
secrets:
db_password:
file: ./secrets/db_password.txt
volumes:
postgres_data:
networks:
app-network:
driver: bridge
开发环境覆盖 docker-compose.override.yml(自动加载):
services:
app:
build: .
volumes:
- .:/app
- /app/node_modules
ports:
- "3000:3000"
- "9229:9229"
environment:
NODE_ENV: development
LOG_LEVEL: debug
deploy:
resources:
limits:
memory: 512M
command: node --inspect=0.0.0.0:9229 server.js
adminer:
image: adminer:latest
ports:
- "8080:8080"
networks:
- app-network
生产环境覆盖 docker-compose.prod.yml:
services:
app:
image: myapp:${APP_VERSION}
ports:
- "3000:3000"
environment:
NODE_ENV: production
LOG_LEVEL: info
deploy:
replicas: 2
resources:
limits:
cpus: '2.0'
memory: 512M
reservations:
cpus: '0.5'
memory: 256M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
logging:
driver: local
options:
max-size: "10m"
max-file: "5"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 15s
postgres:
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
memory: 1G
logging:
driver: local
options:
max-size: "50m"
max-file: "10"
redis:
deploy:
resources:
limits:
memory: 512M
command: redis-server --appendonly yes --maxmemory 400mb --maxmemory-policy allkeys-lru
volumes:
- redis_data:/data
logging:
driver: local
options:
max-size: "10m"
max-file: "3"
nginx:
image: nginx:1.27-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.prod.conf:/etc/nginx/nginx.conf:ro
depends_on:
app:
condition: service_healthy
deploy:
resources:
limits:
memory: 128M
logging:
driver: local
options:
max-size: "50m"
max-file: "10"
volumes:
redis_data:
启动命令:
# 开发环境(自动加载override)
docker compose up -d
# 测试环境
docker compose -f docker-compose.yml -f docker-compose.staging.yml --env-file .env.staging up -d
# 生产环境
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
5大常见陷阱
陷阱1:depends_on不等于服务就绪
❌ 错误写法:
services:
app:
depends_on:
- postgres
# postgres容器启动了,但数据库可能还没初始化完成
✅ 正确写法:
services:
app:
depends_on:
postgres:
condition: service_healthy
postgres:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
陷阱2:不设资源限制
❌ 错误写法:
services:
app:
image: myapp:latest
# 没有任何资源限制,一个内存泄漏能吃掉整个宿主机
✅ 正确写法:
services:
app:
image: myapp:latest
deploy:
resources:
limits:
cpus: '2.0'
memory: 512M
reservations:
cpus: '0.25'
memory: 128M
陷阱3:日志不轮转
❌ 错误写法:
services:
app:
image: myapp:latest
# 默认json-file驱动,日志无限增长
✅ 正确写法:
services:
app:
image: myapp:latest
logging:
driver: local
options:
max-size: "10m"
max-file: "5"
陷阱4:敏感信息明文存储
❌ 错误写法:
services:
postgres:
environment:
POSTGRES_PASSWORD: "my-secret-password-123"
# 密码明文写在compose文件中,提交到Git就完了
✅ 正确写法:
services:
postgres:
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
secrets:
db_password:
file: ./secrets/db_password.txt
陷阱5:使用latest标签
❌ 错误写法:
services:
app:
image: myapp:latest
# 每次拉取可能是不同的镜像,不可复现
✅ 正确写法:
services:
app:
image: myapp:2.1.0
# 或者使用变量控制版本
image: myapp:${APP_VERSION:-2.1.0}
错误排查速查表
| 错误信息 | 原因 | 解决方案 |
|---|---|---|
OOMKilled |
容器内存超限 | 增加 memory 限制或优化应用内存使用 |
Connection refused |
依赖服务未就绪 | 添加 healthcheck + depends_on.condition |
no space left on device |
日志/镜像占满磁盘 | 配置 logging 轮转 + docker system prune |
restarting 循环 |
应用启动即崩溃 | docker logs <id> 查看日志,检查配置 |
permission denied |
文件/目录权限问题 | 检查 user 指令和volume权限 |
port is already allocated |
端口冲突 | 修改端口映射或停止占用进程 |
unhealthy 状态 |
健康检查失败 | 检查 healthcheck 命令是否正确 |
secret not found |
Secret文件不存在 | 确保 secrets/ 目录下有对应文件 |
Cannot connect to the Docker daemon |
Docker未运行 | systemctl start docker |
image pulling failed |
镜像拉取失败 | 检查网络/镜像仓库认证/镜像名拼写 |
高级优化
Docker Compose Watch加速开发
Compose Watch可以在文件变更时自动同步到容器,无需重建镜像:
services:
app:
build: .
develop:
watch:
- action: sync
path: ./src
target: /app/src
- action: rebuild
path: ./package.json
- action: sync+restart
path: ./config
target: /app/config
# 启动watch模式
docker compose watch
网络隔离与安全
services:
app:
networks:
- frontend
- backend
postgres:
networks:
- backend
# postgres不在frontend网络中,外部无法直接访问
nginx:
networks:
- frontend
ports:
- "80:80"
networks:
frontend:
driver: bridge
backend:
driver: bridge
internal: true
# internal:true 禁止外部访问
镜像优化与多阶段构建
# ---- 构建阶段 ----
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# ---- 运行阶段 ----
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S appgroup && \
adduser -S appuser -u 1001 -G appgroup
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
编排工具对比
| 特性 | Docker Compose | Kubernetes | Nomad | Docker Swarm |
|---|---|---|---|---|
| 复杂度 | 低 | 高 | 中 | 低 |
| 单机部署 | ✅ 优秀 | ❌ 过重 | ⚠️ 可用 | ✅ 优秀 |
| 多机编排 | ❌ 不支持 | ✅ 核心能力 | ✅ 核心能力 | ⚠️ 基础 |
| 自动扩缩容 | ❌ | ✅ HPA/VPA | ✅ | ⚠️ 手动 |
| 滚动更新 | ⚠️ 需脚本 | ✅ 原生 | ✅ 原生 | ✅ 原生 |
| 服务发现 | ⚠️ DNS | ✅ CoreDNS | ✅ Consul | ✅ DNS |
| 存储编排 | ❌ | ✅ CSI | ✅ CSI | ⚠️ 基础 |
| 学习曲线 | 低 | 高 | 中 | 低 |
| 适用规模 | 1-10服务 | 100+服务 | 50+服务 | 10-50服务 |
| 生产就绪 | ✅ 单机 | ✅ 大规模 | ✅ 中大规模 | ⚠️ 社区不活跃 |
选型建议:单机/小规模生产环境用Docker Compose,中大规模用Kubernetes,HashiCorp生态用Nomad。Docker Swarm已逐渐边缘化,新项目不建议选用。
总结
Docker Compose生产部署不是简单的
docker compose up -d。健康检查保证服务真正就绪,资源限制防止OOM雪崩,日志轮转避免磁盘爆满,Secrets保护敏感信息,零停机更新保障7×24可用,监控栈让你告别盲飞,多环境配置让dev/staging/prod各得其所。掌握这7大策略,Docker Compose完全能胜任中小规模的生产部署。
推荐工具
本站提供浏览器本地工具,免注册即可试用 →