Docker Compose生产部署实战:从健康检查到零停机更新的7大关键策略

DevOps

"在我机器上能跑啊"——生产环境的容器坟场

每个开发者的口头禅:"在我机器上能跑啊"。但当容器被部署到生产环境后,噩梦才真正开始:

  • 容器悄无声息地OOM Killed,日志里只剩一行 Out of memory
  • 数据库还没启动完成,应用容器已经疯狂报错 Connection refused
  • 凌晨3点容器挂了,没有重启策略,服务直接中断到天亮
  • 日志文件撑爆磁盘,docker logs 输出几十GB的无结构文本
  • 线上配置明文写在 docker-compose.yml,数据库密码裸奔

如果你在生产环境还在用 docker compose up -d 一把梭,这篇文章就是为你准备的。

核心概念速查表

概念 作用 生产环境关键配置
Health Check 检测容器是否真正可用 healthcheck + depends_on.condition
Resource Limits 限制CPU/内存,防止资源抢占 deploy.resources.limits
Restart Policy 容器异常退出后自动重启 deploy.restart_policy
Secrets 敏感信息加密存储 secrets + Docker Secret
Logging Driver 结构化日志 + 日志轮转 logging.driver + logging.options
Profiles 按环境选择性启动服务 profiles
Watch 文件变更自动同步到容器 watch (Compose Watch)

生产环境的5大挑战

挑战1:容器启动顺序不可控

数据库还在初始化,应用容器已经尝试连接,导致启动失败。depends_on 只保证启动顺序,不保证服务就绪。

挑战2:资源无限膨胀

没有资源限制的容器就像没有刹车的汽车。一个内存泄漏的容器能吃掉整个宿主机的内存,拖垮所有服务。

挑战3:日志黑洞

默认的 json-file 日志驱动不会自动轮转。运行3个月后,/var/lib/docker 占满磁盘,服务全部崩溃。

挑战4:敏感信息泄露

environment 里明文写数据库密码,.env 文件被提交到Git仓库,镜像里硬编码API Key——这些都是生产事故的定时炸弹。

挑战5:更新即停机

docker compose up -d 默认先停旧容器再启新容器,更新期间服务不可用。对于7×24小时的服务,这不可接受。

7大实战模式

模式1:健康检查与依赖排序

问题depends_on 只控制启动顺序,不保证服务真正就绪。

解决方案:使用 healthcheck + depends_on.condition: service_healthy

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 5s
      timeout: 3s
      retries: 5
      start_period: 10s
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  app:
    image: myapp:latest
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    ports:
      - "3000:3000"

关键参数说明

  • interval:检查间隔,生产环境建议5-10秒
  • timeout:单次检查超时,建议3-5秒
  • retries:连续失败次数后标记为unhealthy
  • start_period:容器启动后的宽限期,期间失败不计入retries

模式2:资源限制与OOM保护

问题:无限制的容器会抢占宿主机资源,一个失控的容器能拖垮整台机器。

解决方案:使用 deploy.resources 设置CPU和内存的limits与reservations。

services:
  app:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 512M
        reservations:
          cpus: '0.5'
          memory: 256M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  worker:
    image: myworker:latest
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 1G
        reservations:
          memory: 512M

limits vs reservations

  • limits:硬上限,超过会被OOM Kill或CPU限流
  • reservations:软保证,调度器尽量满足,但不强制

OOM保护策略

services:
  critical-service:
    image: critical-app:latest
    deploy:
      resources:
        limits:
          memory: 1G
    # 在容器内设置OOM Score,降低被系统优先杀掉的概率
    # 需要在Dockerfile或启动脚本中设置
    cap_add:
      - SYS_PTRACE

在宿主机层面,可以设置 vm.overcommit_memory 和调整OOM策略:

# 查看容器OOM Score
docker inspect --format='{{.State.OOMKilled}}' <container_id>

# 设置宿主机OOM策略:不杀关键进程
echo -1000 > /proc/<pid>/oom_score_adj

模式3:结构化日志与日志轮转

问题:默认 json-file 日志驱动不轮转,磁盘会被撑爆。

解决方案:配置日志驱动 + 轮转策略,生产环境推荐 local 驱动。

services:
  app:
    image: myapp:latest
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "5"
        tag: "{{.Name}}/{{.ID}}"

  nginx:
    image: nginx:1.27-alpine
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "10"
        tag: "nginx/{{.Name}}"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro

日志驱动对比

驱动 适用场景 优点 缺点
local 生产环境默认推荐 自动轮转,压缩存储 仅本地存储
json-file 需要结构化JSON日志 Docker原生支持 需手动配轮转
syslog 集中式日志收集 可发送到远程 配置复杂
fluentd 与EFK栈集成 灵活的日志路由 需额外部署Fluentd

应用层结构化日志

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

# 设置时区
ENV TZ=Asia/Shanghai

# JSON格式日志输出
ENV LOG_FORMAT=json

CMD ["node", "server.js"]
// 应用层使用结构化日志
const logger = {
  info: (msg, meta = {}) => {
    console.log(JSON.stringify({ level: 'info', msg, ts: new Date().toISOString(), ...meta }));
  },
  error: (msg, meta = {}) => {
    console.error(JSON.stringify({ level: 'error', msg, ts: new Date().toISOString(), ...meta }));
  }
};

模式4:Secrets管理

问题:敏感信息明文写在compose文件或环境变量中。

解决方案:Docker Secrets + _FILE 后缀环境变量。

secrets:
  db_password:
    file: ./secrets/db_password.txt
  api_key:
    file: ./secrets/api_key.txt
  jwt_secret:
    file: ./secrets/jwt_secret.txt

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password

  app:
    image: myapp:latest
    environment:
      DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
      API_KEY_FILE: /run/secrets/api_key
      JWT_SECRET_FILE: /run/secrets/jwt_secret
    secrets:
      - api_key
      - jwt_secret

Secrets文件管理

# 创建secrets目录
mkdir -p secrets
chmod 700 secrets

# 写入secret文件
echo "my-super-secret-password-2026" > secrets/db_password.txt
echo "ak-live-xxxx-yyyy-zzzz" > secrets/api_key.txt
echo "jwt-hs256-secret-key-here" > secrets/jwt_secret.txt

# 设置权限:仅root可读
chmod 600 secrets/*.txt

.gitignore 必须包含

secrets/
*.secret
.env.production
.env.staging

Docker Secrets vs .env 对比

特性 Docker Secrets .env文件
加密存储 是(Swarm模式下)
文件权限 受限(/run/secrets/) 依赖文件系统权限
审计追踪
跨节点同步 Swarm自动同步 需手动分发
适用场景 Swarm/单机生产环境 开发环境

模式5:零停机滚动更新

问题docker compose up -d 默认先停旧再启新,更新期间服务不可用。

解决方案:使用 docker compose up --no-down + 健康检查 + 反向代理。

services:
  app:
    image: myapp:${APP_VERSION:-latest}
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
        failure_action: rollback
      rollback_config:
        parallelism: 0
        order: stop-first
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 15s
    labels:
      - "com.toolsku.app=true"
    ports:
      - "3000-3001:3000"

  nginx:
    image: nginx:1.27-alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      app:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 10s
      timeout: 3s
      retries: 3

Nginx反向代理配置(自动发现上游):

upstream app_backend {
    server app:3000;
    # 多副本时Nginx自动负载均衡
}

server {
    listen 80;
    server_name app.example.com;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # 健康检查端点
        proxy_next_upstream error timeout http_502 http_503;
    }

    location /health {
        access_log off;
        return 200 'ok';
        add_header Content-Type text/plain;
    }
}

零停机更新脚本

#!/bin/bash
set -euo pipefail

NEW_IMAGE="myapp:v2.0.0"
echo "🚀 Starting zero-downtime update to ${NEW_IMAGE}"

# 1. 拉取新镜像
docker compose pull app

# 2. 启动新容器(不停旧容器)
docker compose up -d --no-deps --scale app=2 app

# 3. 等待新容器健康
echo "⏳ Waiting for new containers to be healthy..."
sleep 15

# 4. 验证新容器健康
for i in $(seq 1 30); do
  if curl -sf http://localhost:3000/health > /dev/null 2>&1; then
    echo "✅ New container is healthy"
    break
  fi
  if [ $i -eq 30 ]; then
    echo "❌ Health check failed, rolling back..."
    docker compose up -d --no-deps --scale app=1 app
    exit 1
  fi
  sleep 2
done

# 5. 缩减到1个副本
docker compose up -d --no-deps --scale app=1 app

echo "🎉 Update completed successfully"

模式6:Prometheus + Grafana监控栈

问题:生产环境没有监控就是盲飞,出了问题只能靠用户反馈。

解决方案:部署完整的Prometheus + Grafana监控栈。

services:
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=5GB'
    deploy:
      resources:
        limits:
          memory: 512M
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
    secrets:
      - grafana_password
    volumes:
      - grafana_data:/var/lib/grafana
    deploy:
      resources:
        limits:
          memory: 256M
    depends_on:
      - prometheus
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    networks:
      - monitoring

secrets:
  grafana_password:
    file: ./secrets/grafana_password.txt

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Prometheus配置

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app'
    static_configs:
      - targets: ['app:3000']
    metrics_path: /metrics

告警规则

# monitoring/alert_rules.yml
groups:
  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.instance }} is down"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes / (1024 * 1024) > 400
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage exceeds 400MB on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

模式7:多环境配置(dev/staging/prod)

问题:开发、测试、生产环境配置混杂,改一个环境配置怕影响其他环境。

解决方案:使用 docker-compose.override.yml + 多文件覆盖策略。

目录结构

project/
├── docker-compose.yml              # 基础配置
├── docker-compose.override.yml     # 开发环境覆盖(自动加载)
├── docker-compose.staging.yml      # 测试环境覆盖
├── docker-compose.prod.yml         # 生产环境覆盖
├── .env                            # 默认环境变量
├── .env.staging                    # 测试环境变量
├── .env.prod                       # 生产环境变量
├── monitoring/
│   ├── prometheus.yml
│   └── alertmanager.yml
└── secrets/
    ├── db_password.txt
    ├── api_key.txt
    └── grafana_password.txt

基础配置 docker-compose.yml

services:
  app:
    image: myapp:${APP_VERSION:-latest}
    environment:
      NODE_ENV: ${NODE_ENV:-development}
      DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - app-network

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 5s
      timeout: 3s
      retries: 5
    networks:
      - app-network

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
    networks:
      - app-network

secrets:
  db_password:
    file: ./secrets/db_password.txt

volumes:
  postgres_data:

networks:
  app-network:
    driver: bridge

开发环境覆盖 docker-compose.override.yml(自动加载):

services:
  app:
    build: .
    volumes:
      - .:/app
      - /app/node_modules
    ports:
      - "3000:3000"
      - "9229:9229"
    environment:
      NODE_ENV: development
      LOG_LEVEL: debug
    deploy:
      resources:
        limits:
          memory: 512M
    command: node --inspect=0.0.0.0:9229 server.js

  adminer:
    image: adminer:latest
    ports:
      - "8080:8080"
    networks:
      - app-network

生产环境覆盖 docker-compose.prod.yml

services:
  app:
    image: myapp:${APP_VERSION}
    ports:
      - "3000:3000"
    environment:
      NODE_ENV: production
      LOG_LEVEL: info
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: '2.0'
          memory: 512M
        reservations:
          cpus: '0.5'
          memory: 256M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "5"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 15s

  postgres:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          memory: 1G
    logging:
      driver: local
      options:
        max-size: "50m"
        max-file: "10"

  redis:
    deploy:
      resources:
        limits:
          memory: 512M
    command: redis-server --appendonly yes --maxmemory 400mb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "3"

  nginx:
    image: nginx:1.27-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.prod.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      app:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 128M
    logging:
      driver: local
      options:
        max-size: "50m"
        max-file: "10"

volumes:
  redis_data:

启动命令

# 开发环境(自动加载override)
docker compose up -d

# 测试环境
docker compose -f docker-compose.yml -f docker-compose.staging.yml --env-file .env.staging up -d

# 生产环境
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d

5大常见陷阱

陷阱1:depends_on不等于服务就绪

错误写法

services:
  app:
    depends_on:
      - postgres
    # postgres容器启动了,但数据库可能还没初始化完成

正确写法

services:
  app:
    depends_on:
      postgres:
        condition: service_healthy
  postgres:
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 5s
      timeout: 3s
      retries: 5
      start_period: 10s

陷阱2:不设资源限制

错误写法

services:
  app:
    image: myapp:latest
    # 没有任何资源限制,一个内存泄漏能吃掉整个宿主机

正确写法

services:
  app:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 128M

陷阱3:日志不轮转

错误写法

services:
  app:
    image: myapp:latest
    # 默认json-file驱动,日志无限增长

正确写法

services:
  app:
    image: myapp:latest
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "5"

陷阱4:敏感信息明文存储

错误写法

services:
  postgres:
    environment:
      POSTGRES_PASSWORD: "my-secret-password-123"
      # 密码明文写在compose文件中,提交到Git就完了

正确写法

services:
  postgres:
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password

secrets:
  db_password:
    file: ./secrets/db_password.txt

陷阱5:使用latest标签

错误写法

services:
  app:
    image: myapp:latest
    # 每次拉取可能是不同的镜像,不可复现

正确写法

services:
  app:
    image: myapp:2.1.0
    # 或者使用变量控制版本
    image: myapp:${APP_VERSION:-2.1.0}

错误排查速查表

错误信息 原因 解决方案
OOMKilled 容器内存超限 增加 memory 限制或优化应用内存使用
Connection refused 依赖服务未就绪 添加 healthcheck + depends_on.condition
no space left on device 日志/镜像占满磁盘 配置 logging 轮转 + docker system prune
restarting 循环 应用启动即崩溃 docker logs <id> 查看日志,检查配置
permission denied 文件/目录权限问题 检查 user 指令和volume权限
port is already allocated 端口冲突 修改端口映射或停止占用进程
unhealthy 状态 健康检查失败 检查 healthcheck 命令是否正确
secret not found Secret文件不存在 确保 secrets/ 目录下有对应文件
Cannot connect to the Docker daemon Docker未运行 systemctl start docker
image pulling failed 镜像拉取失败 检查网络/镜像仓库认证/镜像名拼写

高级优化

Docker Compose Watch加速开发

Compose Watch可以在文件变更时自动同步到容器,无需重建镜像:

services:
  app:
    build: .
    develop:
      watch:
        - action: sync
          path: ./src
          target: /app/src
        - action: rebuild
          path: ./package.json
        - action: sync+restart
          path: ./config
          target: /app/config
# 启动watch模式
docker compose watch

网络隔离与安全

services:
  app:
    networks:
      - frontend
      - backend

  postgres:
    networks:
      - backend
    # postgres不在frontend网络中,外部无法直接访问

  nginx:
    networks:
      - frontend
    ports:
      - "80:80"

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true
    # internal:true 禁止外部访问

镜像优化与多阶段构建

# ---- 构建阶段 ----
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# ---- 运行阶段 ----
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S appgroup && \
    adduser -S appuser -u 1001 -G appgroup

COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER appuser
EXPOSE 3000

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

CMD ["node", "dist/server.js"]

编排工具对比

特性 Docker Compose Kubernetes Nomad Docker Swarm
复杂度
单机部署 ✅ 优秀 ❌ 过重 ⚠️ 可用 ✅ 优秀
多机编排 ❌ 不支持 ✅ 核心能力 ✅ 核心能力 ⚠️ 基础
自动扩缩容 ✅ HPA/VPA ⚠️ 手动
滚动更新 ⚠️ 需脚本 ✅ 原生 ✅ 原生 ✅ 原生
服务发现 ⚠️ DNS ✅ CoreDNS ✅ Consul ✅ DNS
存储编排 ✅ CSI ✅ CSI ⚠️ 基础
学习曲线
适用规模 1-10服务 100+服务 50+服务 10-50服务
生产就绪 ✅ 单机 ✅ 大规模 ✅ 中大规模 ⚠️ 社区不活跃

选型建议:单机/小规模生产环境用Docker Compose,中大规模用Kubernetes,HashiCorp生态用Nomad。Docker Swarm已逐渐边缘化,新项目不建议选用。

总结

Docker Compose生产部署不是简单的 docker compose up -d。健康检查保证服务真正就绪,资源限制防止OOM雪崩,日志轮转避免磁盘爆满,Secrets保护敏感信息,零停机更新保障7×24可用,监控栈让你告别盲飞,多环境配置让dev/staging/prod各得其所。掌握这7大策略,Docker Compose完全能胜任中小规模的生产部署。

推荐工具

  • JSON格式化 - 格式化Docker Compose YAML相关的JSON配置
  • Base64编码 - 编码Secrets和敏感配置信息
  • 哈希计算 - 为配置文件生成校验和,确保部署一致性

本站提供浏览器本地工具,免注册即可试用 →

#Docker#Docker Compose#生产部署#容器编排#2026#DevOps