Docker Compose生产部署实战：从健康检查到零停机更新的7大关键策略

"在我机器上能跑啊"——生产环境的容器坟场

每个开发者的口头禅："在我机器上能跑啊"。但当容器被部署到生产环境后，噩梦才真正开始：

容器悄无声息地OOM Killed，日志里只剩一行 Out of memory
数据库还没启动完成，应用容器已经疯狂报错 Connection refused
凌晨3点容器挂了，没有重启策略，服务直接中断到天亮
日志文件撑爆磁盘，docker logs 输出几十GB的无结构文本
线上配置明文写在 docker-compose.yml，数据库密码裸奔

如果你在生产环境还在用 docker compose up -d 一把梭，这篇文章就是为你准备的。

核心概念速查表

概念	作用	生产环境关键配置
Health Check	检测容器是否真正可用	`healthcheck` + `depends_on.condition`
Resource Limits	限制CPU/内存，防止资源抢占	`deploy.resources.limits`
Restart Policy	容器异常退出后自动重启	`deploy.restart_policy`
Secrets	敏感信息加密存储	`secrets` + Docker Secret
Logging Driver	结构化日志 + 日志轮转	`logging.driver` + `logging.options`
Profiles	按环境选择性启动服务	`profiles`
Watch	文件变更自动同步到容器	`watch` (Compose Watch)

生产环境的5大挑战

挑战1：容器启动顺序不可控

数据库还在初始化，应用容器已经尝试连接，导致启动失败。depends_on 只保证启动顺序，不保证服务就绪。

挑战2：资源无限膨胀

没有资源限制的容器就像没有刹车的汽车。一个内存泄漏的容器能吃掉整个宿主机的内存，拖垮所有服务。

挑战3：日志黑洞

默认的 json-file 日志驱动不会自动轮转。运行3个月后，/var/lib/docker 占满磁盘，服务全部崩溃。

挑战4：敏感信息泄露

environment 里明文写数据库密码，.env 文件被提交到Git仓库，镜像里硬编码API Key——这些都是生产事故的定时炸弹。

挑战5：更新即停机

docker compose up -d 默认先停旧容器再启新容器，更新期间服务不可用。对于7×24小时的服务，这不可接受。

7大实战模式

模式1：健康检查与依赖排序

问题：depends_on 只控制启动顺序，不保证服务真正就绪。

解决方案：使用 healthcheck + depends_on.condition: service_healthy。

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 5s
      timeout: 3s
      retries: 5
      start_period: 10s
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  app:
    image: myapp:latest
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    ports:
      - "3000:3000"

关键参数说明：

interval：检查间隔，生产环境建议5-10秒
timeout：单次检查超时，建议3-5秒
retries：连续失败次数后标记为unhealthy
start_period：容器启动后的宽限期，期间失败不计入retries

模式2：资源限制与OOM保护

问题：无限制的容器会抢占宿主机资源，一个失控的容器能拖垮整台机器。

解决方案：使用 deploy.resources 设置CPU和内存的limits与reservations。

services:
  app:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 512M
        reservations:
          cpus: '0.5'
          memory: 256M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  worker:
    image: myworker:latest
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 1G
        reservations:
          memory: 512M

limits vs reservations：

limits：硬上限，超过会被OOM Kill或CPU限流
reservations：软保证，调度器尽量满足，但不强制

OOM保护策略：

services:
  critical-service:
    image: critical-app:latest
    deploy:
      resources:
        limits:
          memory: 1G
    # 在容器内设置OOM Score，降低被系统优先杀掉的概率
    # 需要在Dockerfile或启动脚本中设置
    cap_add:
      - SYS_PTRACE

在宿主机层面，可以设置 vm.overcommit_memory 和调整OOM策略：

# 查看容器OOM Score
docker inspect --format='{{.State.OOMKilled}}' <container_id>

# 设置宿主机OOM策略：不杀关键进程
echo -1000 > /proc/<pid>/oom_score_adj

模式3：结构化日志与日志轮转

问题：默认 json-file 日志驱动不轮转，磁盘会被撑爆。

解决方案：配置日志驱动 + 轮转策略，生产环境推荐 local 驱动。

services:
  app:
    image: myapp:latest
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "5"
        tag: "{{.Name}}/{{.ID}}"

  nginx:
    image: nginx:1.27-alpine
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "10"
        tag: "nginx/{{.Name}}"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro

日志驱动对比：

驱动	适用场景	优点	缺点
local	生产环境默认推荐	自动轮转，压缩存储	仅本地存储
json-file	需要结构化JSON日志	Docker原生支持	需手动配轮转
syslog	集中式日志收集	可发送到远程	配置复杂
fluentd	与EFK栈集成	灵活的日志路由	需额外部署Fluentd

应用层结构化日志：

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

# 设置时区
ENV TZ=Asia/Shanghai

# JSON格式日志输出
ENV LOG_FORMAT=json

CMD ["node", "server.js"]

// 应用层使用结构化日志
const logger = {
  info: (msg, meta = {}) => {
    console.log(JSON.stringify({ level: 'info', msg, ts: new Date().toISOString(), ...meta }));
  },
  error: (msg, meta = {}) => {
    console.error(JSON.stringify({ level: 'error', msg, ts: new Date().toISOString(), ...meta }));
  }
};

模式4：Secrets管理

问题：敏感信息明文写在compose文件或环境变量中。

解决方案：Docker Secrets + _FILE 后缀环境变量。

secrets:
  db_password:
    file: ./secrets/db_password.txt
  api_key:
    file: ./secrets/api_key.txt
  jwt_secret:
    file: ./secrets/jwt_secret.txt

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password

  app:
    image: myapp:latest
    environment:
      DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
      API_KEY_FILE: /run/secrets/api_key
      JWT_SECRET_FILE: /run/secrets/jwt_secret
    secrets:
      - api_key
      - jwt_secret

Secrets文件管理：

# 创建secrets目录
mkdir -p secrets
chmod 700 secrets

# 写入secret文件
echo "my-super-secret-password-2026" > secrets/db_password.txt
echo "ak-live-xxxx-yyyy-zzzz" > secrets/api_key.txt
echo "jwt-hs256-secret-key-here" > secrets/jwt_secret.txt

# 设置权限：仅root可读
chmod 600 secrets/*.txt

.gitignore 必须包含：

secrets/
*.secret
.env.production
.env.staging

Docker Secrets vs .env 对比：

特性	Docker Secrets	.env文件
加密存储	是（Swarm模式下）	否
文件权限	受限（/run/secrets/）	依赖文件系统权限
审计追踪	有	无
跨节点同步	Swarm自动同步	需手动分发
适用场景	Swarm/单机生产环境	开发环境

模式5：零停机滚动更新

问题：docker compose up -d 默认先停旧再启新，更新期间服务不可用。

解决方案：使用 docker compose up --no-down + 健康检查 + 反向代理。

services:
  app:
    image: myapp:${APP_VERSION:-latest}
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
        failure_action: rollback
      rollback_config:
        parallelism: 0
        order: stop-first
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 15s
    labels:
      - "com.toolsku.app=true"
    ports:
      - "3000-3001:3000"

  nginx:
    image: nginx:1.27-alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      app:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 10s
      timeout: 3s
      retries: 3

Nginx反向代理配置（自动发现上游）：

upstream app_backend {
    server app:3000;
    # 多副本时Nginx自动负载均衡
}

server {
    listen 80;
    server_name app.example.com;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # 健康检查端点
        proxy_next_upstream error timeout http_502 http_503;
    }

    location /health {
        access_log off;
        return 200 'ok';
        add_header Content-Type text/plain;
    }
}

零停机更新脚本：

#!/bin/bash
set -euo pipefail

NEW_IMAGE="myapp:v2.0.0"
echo "🚀 Starting zero-downtime update to ${NEW_IMAGE}"

# 1. 拉取新镜像
docker compose pull app

# 2. 启动新容器（不停旧容器）
docker compose up -d --no-deps --scale app=2 app

# 3. 等待新容器健康
echo "⏳ Waiting for new containers to be healthy..."
sleep 15

# 4. 验证新容器健康
for i in $(seq 1 30); do
  if curl -sf http://localhost:3000/health > /dev/null 2>&1; then
    echo "✅ New container is healthy"
    break
  fi
  if [ $i -eq 30 ]; then
    echo "❌ Health check failed, rolling back..."
    docker compose up -d --no-deps --scale app=1 app
    exit 1
  fi
  sleep 2
done

# 5. 缩减到1个副本
docker compose up -d --no-deps --scale app=1 app

echo "🎉 Update completed successfully"

模式6：Prometheus + Grafana监控栈

问题：生产环境没有监控就是盲飞，出了问题只能靠用户反馈。

解决方案：部署完整的Prometheus + Grafana监控栈。

services:
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=5GB'
    deploy:
      resources:
        limits:
          memory: 512M
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.0.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
    secrets:
      - grafana_password
    volumes:
      - grafana_data:/var/lib/grafana
    deploy:
      resources:
        limits:
          memory: 256M
    depends_on:
      - prometheus
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    networks:
      - monitoring

secrets:
  grafana_password:
    file: ./secrets/grafana_password.txt

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Prometheus配置：

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app'
    static_configs:
      - targets: ['app:3000']
    metrics_path: /metrics

告警规则：

# monitoring/alert_rules.yml
groups:
  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.instance }} is down"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes / (1024 * 1024) > 400
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage exceeds 400MB on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

模式7：多环境配置（dev/staging/prod）

问题：开发、测试、生产环境配置混杂，改一个环境配置怕影响其他环境。

解决方案：使用 docker-compose.override.yml + 多文件覆盖策略。

目录结构：

project/
├── docker-compose.yml              # 基础配置
├── docker-compose.override.yml     # 开发环境覆盖（自动加载）
├── docker-compose.staging.yml      # 测试环境覆盖
├── docker-compose.prod.yml         # 生产环境覆盖
├── .env                            # 默认环境变量
├── .env.staging                    # 测试环境变量
├── .env.prod                       # 生产环境变量
├── monitoring/
│   ├── prometheus.yml
│   └── alertmanager.yml
└── secrets/
    ├── db_password.txt
    ├── api_key.txt
    └── grafana_password.txt

基础配置 docker-compose.yml：

services:
  app:
    image: myapp:${APP_VERSION:-latest}
    environment:
      NODE_ENV: ${NODE_ENV:-development}
      DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
      REDIS_URL: redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - app-network

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 5s
      timeout: 3s
      retries: 5
    networks:
      - app-network

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5
    networks:
      - app-network

secrets:
  db_password:
    file: ./secrets/db_password.txt

volumes:
  postgres_data:

networks:
  app-network:
    driver: bridge

开发环境覆盖 docker-compose.override.yml（自动加载）：

services:
  app:
    build: .
    volumes:
      - .:/app
      - /app/node_modules
    ports:
      - "3000:3000"
      - "9229:9229"
    environment:
      NODE_ENV: development
      LOG_LEVEL: debug
    deploy:
      resources:
        limits:
          memory: 512M
    command: node --inspect=0.0.0.0:9229 server.js

  adminer:
    image: adminer:latest
    ports:
      - "8080:8080"
    networks:
      - app-network

生产环境覆盖 docker-compose.prod.yml：

services:
  app:
    image: myapp:${APP_VERSION}
    ports:
      - "3000:3000"
    environment:
      NODE_ENV: production
      LOG_LEVEL: info
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: '2.0'
          memory: 512M
        reservations:
          cpus: '0.5'
          memory: 256M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "5"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 15s

  postgres:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          memory: 1G
    logging:
      driver: local
      options:
        max-size: "50m"
        max-file: "10"

  redis:
    deploy:
      resources:
        limits:
          memory: 512M
    command: redis-server --appendonly yes --maxmemory 400mb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "3"

  nginx:
    image: nginx:1.27-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.prod.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      app:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 128M
    logging:
      driver: local
      options:
        max-size: "50m"
        max-file: "10"

volumes:
  redis_data:

启动命令：

# 开发环境（自动加载override）
docker compose up -d

# 测试环境
docker compose -f docker-compose.yml -f docker-compose.staging.yml --env-file .env.staging up -d

# 生产环境
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d

5大常见陷阱

陷阱1：depends_on不等于服务就绪

❌ 错误写法：

services:
  app:
    depends_on:
      - postgres
    # postgres容器启动了，但数据库可能还没初始化完成

✅ 正确写法：

services:
  app:
    depends_on:
      postgres:
        condition: service_healthy
  postgres:
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 5s
      timeout: 3s
      retries: 5
      start_period: 10s

陷阱2：不设资源限制

❌ 错误写法：

services:
  app:
    image: myapp:latest
    # 没有任何资源限制，一个内存泄漏能吃掉整个宿主机

✅ 正确写法：

services:
  app:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 128M

陷阱3：日志不轮转

❌ 错误写法：

services:
  app:
    image: myapp:latest
    # 默认json-file驱动，日志无限增长

✅ 正确写法：

services:
  app:
    image: myapp:latest
    logging:
      driver: local
      options:
        max-size: "10m"
        max-file: "5"

陷阱4：敏感信息明文存储

❌ 错误写法：

services:
  postgres:
    environment:
      POSTGRES_PASSWORD: "my-secret-password-123"
      # 密码明文写在compose文件中，提交到Git就完了

✅ 正确写法：

services:
  postgres:
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password

secrets:
  db_password:
    file: ./secrets/db_password.txt

陷阱5：使用latest标签

❌ 错误写法：

services:
  app:
    image: myapp:latest
    # 每次拉取可能是不同的镜像，不可复现

✅ 正确写法：

services:
  app:
    image: myapp:2.1.0
    # 或者使用变量控制版本
    image: myapp:${APP_VERSION:-2.1.0}

错误排查速查表

错误信息	原因	解决方案
`OOMKilled`	容器内存超限	增加 `memory` 限制或优化应用内存使用
`Connection refused`	依赖服务未就绪	添加 `healthcheck` + `depends_on.condition`
`no space left on device`	日志/镜像占满磁盘	配置 `logging` 轮转 + `docker system prune`
`restarting` 循环	应用启动即崩溃	`docker logs <id>` 查看日志，检查配置
`permission denied`	文件/目录权限问题	检查 `user` 指令和volume权限
`port is already allocated`	端口冲突	修改端口映射或停止占用进程
`unhealthy` 状态	健康检查失败	检查 `healthcheck` 命令是否正确
`secret not found`	Secret文件不存在	确保 `secrets/` 目录下有对应文件
`Cannot connect to the Docker daemon`	Docker未运行	`systemctl start docker`
`image pulling failed`	镜像拉取失败	检查网络/镜像仓库认证/镜像名拼写

高级优化

Docker Compose Watch加速开发

Compose Watch可以在文件变更时自动同步到容器，无需重建镜像：

services:
  app:
    build: .
    develop:
      watch:
        - action: sync
          path: ./src
          target: /app/src
        - action: rebuild
          path: ./package.json
        - action: sync+restart
          path: ./config
          target: /app/config

# 启动watch模式
docker compose watch

网络隔离与安全

services:
  app:
    networks:
      - frontend
      - backend

  postgres:
    networks:
      - backend
    # postgres不在frontend网络中，外部无法直接访问

  nginx:
    networks:
      - frontend
    ports:
      - "80:80"

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true
    # internal:true 禁止外部访问

镜像优化与多阶段构建

# ---- 构建阶段 ----
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# ---- 运行阶段 ----
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S appgroup && \
    adduser -S appuser -u 1001 -G appgroup

COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER appuser
EXPOSE 3000

HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

CMD ["node", "dist/server.js"]

编排工具对比

特性	Docker Compose	Kubernetes	Nomad	Docker Swarm
复杂度	低	高	中	低
单机部署	✅ 优秀	❌ 过重	⚠️ 可用	✅ 优秀
多机编排	❌ 不支持	✅ 核心能力	✅ 核心能力	⚠️ 基础
自动扩缩容	❌	✅ HPA/VPA	✅	⚠️ 手动
滚动更新	⚠️ 需脚本	✅ 原生	✅ 原生	✅ 原生
服务发现	⚠️ DNS	✅ CoreDNS	✅ Consul	✅ DNS
存储编排	❌	✅ CSI	✅ CSI	⚠️ 基础
学习曲线	低	高	中	低
适用规模	1-10服务	100+服务	50+服务	10-50服务
生产就绪	✅ 单机	✅ 大规模	✅ 中大规模	⚠️ 社区不活跃

选型建议：单机/小规模生产环境用Docker Compose，中大规模用Kubernetes，HashiCorp生态用Nomad。Docker Swarm已逐渐边缘化，新项目不建议选用。

总结

Docker Compose生产部署不是简单的 docker compose up -d。健康检查保证服务真正就绪，资源限制防止OOM雪崩，日志轮转避免磁盘爆满，Secrets保护敏感信息，零停机更新保障7×24可用，监控栈让你告别盲飞，多环境配置让dev/staging/prod各得其所。掌握这7大策略，Docker Compose完全能胜任中小规模的生产部署。

"在我机器上能跑啊"——生产环境的容器坟场

核心概念速查表

生产环境的5大挑战

挑战1：容器启动顺序不可控

挑战2：资源无限膨胀

挑战3：日志黑洞

挑战4：敏感信息泄露

挑战5：更新即停机

7大实战模式

模式1：健康检查与依赖排序

模式2：资源限制与OOM保护

模式3：结构化日志与日志轮转

模式4：Secrets管理

模式5：零停机滚动更新

模式6：Prometheus + Grafana监控栈

模式7：多环境配置（dev/staging/prod）

5大常见陷阱

陷阱1：depends_on不等于服务就绪

陷阱2：不设资源限制

陷阱3：日志不轮转

陷阱4：敏感信息明文存储

陷阱5：使用latest标签

错误排查速查表

高级优化

Docker Compose Watch加速开发

网络隔离与安全

镜像优化与多阶段构建

编排工具对比

总结

推荐工具