Docker Compose生產部署實戰:從健康檢查到零停機更新的7大關鍵策略
「在我機器上能跑啊」——生產環境的容器墳場
每個開發者的口頭禪:「在我機器上能跑啊」。但當容器被部署到生產環境後,噩夢才真正開始:
- 容器悄無聲息地OOM Killed,日誌裡只剩一行
Out of memory - 資料庫還沒啟動完成,應用容器已經瘋狂報錯
Connection refused - 凌晨3點容器掛了,沒有重啟策略,服務直接中斷到天亮
- 日誌檔案撐爆磁碟,
docker logs輸出幾十GB的無結構文字 - 線上配置明文寫在
docker-compose.yml,資料庫密碼裸奔
如果你在生產環境還在用 docker compose up -d 一把梭,這篇文章就是為你準備的。
核心概念速查表
| 概念 | 作用 | 生產環境關鍵配置 |
|---|---|---|
| Health Check | 檢測容器是否真正可用 | healthcheck + depends_on.condition |
| Resource Limits | 限制CPU/記憶體,防止資源搶佔 | deploy.resources.limits |
| Restart Policy | 容器異常退出後自動重啟 | deploy.restart_policy |
| Secrets | 敏感資訊加密儲存 | secrets + Docker Secret |
| Logging Driver | 結構化日誌 + 日誌輪轉 | logging.driver + logging.options |
| Profiles | 按環境選擇性啟動服務 | profiles |
| Watch | 檔案變更自動同步到容器 | watch (Compose Watch) |
生產環境的5大挑戰
挑戰1:容器啟動順序不可控
資料庫還在初始化,應用容器已經嘗試連線,導致啟動失敗。depends_on 只保證啟動順序,不保證服務就緒。
挑戰2:資源無限膨脹
沒有資源限制的容器就像沒有剎車的汽車。一個記憶體洩漏的容器能吃掉整個宿主機的記憶體,拖垮所有服務。
挑戰3:日誌黑洞
預設的 json-file 日誌驅動不會自動輪轉。執行3個月後,/var/lib/docker 佔滿磁碟,服務全部崩潰。
挑戰4:敏感資訊洩露
environment 裡明文寫資料庫密碼,.env 檔案被提交到Git倉庫,映像裡硬編碼API Key——這些都是生產事故的定時炸彈。
挑戰5:更新即停機
docker compose up -d 預設先停舊容器再啟新容器,更新期間服務不可用。對於7×24小時的服務,這不可接受。
7大實戰模式
模式1:健康檢查與依賴排序
問題:depends_on 只控制啟動順序,不保證服務真正就緒。
解決方案:使用 healthcheck + depends_on.condition: service_healthy。
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
app:
image: myapp:latest
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
ports:
- "3000:3000"
關鍵參數說明:
interval:檢查間隔,生產環境建議5-10秒timeout:單次檢查逾時,建議3-5秒retries:連續失敗次數後標記為unhealthystart_period:容器啟動後的寬限期,期間失敗不計入retries
模式2:資源限制與OOM保護
問題:無限制的容器會搶佔宿主機資源,一個失控的容器能拖垮整臺機器。
解決方案:使用 deploy.resources 設定CPU和記憶體的limits與reservations。
services:
app:
image: myapp:latest
deploy:
resources:
limits:
cpus: '2.0'
memory: 512M
reservations:
cpus: '0.5'
memory: 256M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
worker:
image: myworker:latest
deploy:
resources:
limits:
cpus: '1.0'
memory: 1G
reservations:
memory: 512M
limits vs reservations:
limits:硬上限,超過會被OOM Kill或CPU限流reservations:軟保證,排程器盡量滿足,但不強制
OOM保護策略:
services:
critical-service:
image: critical-app:latest
deploy:
resources:
limits:
memory: 1G
cap_add:
- SYS_PTRACE
在宿主機層面,可以設定 vm.overcommit_memory 和調整OOM策略:
# 檢視容器OOM Score
docker inspect --format='{{.State.OOMKilled}}' <container_id>
# 設定宿主機OOM策略:不殺關鍵程序
echo -1000 > /proc/<pid>/oom_score_adj
模式3:結構化日誌與日誌輪轉
問題:預設 json-file 日誌驅動不輪轉,磁碟會被撐爆。
解決方案:配置日誌驅動 + 輪轉策略,生產環境推薦 local 驅動。
services:
app:
image: myapp:latest
logging:
driver: local
options:
max-size: "10m"
max-file: "5"
tag: "{{.Name}}/{{.ID}}"
nginx:
image: nginx:1.27-alpine
logging:
driver: json-file
options:
max-size: "50m"
max-file: "10"
tag: "nginx/{{.Name}}"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
日誌驅動對比:
| 驅動 | 適用場景 | 優點 | 缺點 |
|---|---|---|---|
| local | 生產環境預設推薦 | 自動輪轉,壓縮儲存 | 僅本地儲存 |
| json-file | 需要結構化JSON日誌 | Docker原生支援 | 需手動配輪轉 |
| syslog | 集中式日誌收集 | 可傳送到遠端 | 配置複雜 |
| fluentd | 與EFK棧整合 | 靈活的日誌路由 | 需額外部署Fluentd |
應用層結構化日誌:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
ENV TZ=Asia/Taipei
ENV LOG_FORMAT=json
CMD ["node", "server.js"]
const logger = {
info: (msg, meta = {}) => {
console.log(JSON.stringify({ level: 'info', msg, ts: new Date().toISOString(), ...meta }));
},
error: (msg, meta = {}) => {
console.error(JSON.stringify({ level: 'error', msg, ts: new Date().toISOString(), ...meta }));
}
};
模式4:Secrets管理
問題:敏感資訊明文寫在compose檔案或環境變數中。
解決方案:Docker Secrets + _FILE 後綴環境變數。
secrets:
db_password:
file: ./secrets/db_password.txt
api_key:
file: ./secrets/api_key.txt
jwt_secret:
file: ./secrets/jwt_secret.txt
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
app:
image: myapp:latest
environment:
DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
API_KEY_FILE: /run/secrets/api_key
JWT_SECRET_FILE: /run/secrets/jwt_secret
secrets:
- api_key
- jwt_secret
Secrets檔案管理:
# 建立secrets目錄
mkdir -p secrets
chmod 700 secrets
# 寫入secret檔案
echo "my-super-secret-password-2026" > secrets/db_password.txt
echo "ak-live-xxxx-yyyy-zzzz" > secrets/api_key.txt
echo "jwt-hs256-secret-key-here" > secrets/jwt_secret.txt
# 設定許可權:僅root可讀
chmod 600 secrets/*.txt
.gitignore 必須包含:
secrets/
*.secret
.env.production
.env.staging
Docker Secrets vs .env 對比:
| 特性 | Docker Secrets | .env檔案 |
|---|---|---|
| 加密儲存 | 是(Swarm模式下) | 否 |
| 檔案許可權 | 受限(/run/secrets/) | 依賴檔案系統許可權 |
| 稽核追蹤 | 有 | 無 |
| 跨節點同步 | Swarm自動同步 | 需手動分發 |
| 適用場景 | Swarm/單機生產環境 | 開發環境 |
模式5:零停機滾動更新
問題:docker compose up -d 預設先停舊再啟新,更新期間服務不可用。
解決方案:使用 docker compose up --no-down + 健康檢查 + 反向代理。
services:
app:
image: myapp:${APP_VERSION:-latest}
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 10s
order: start-first
failure_action: rollback
rollback_config:
parallelism: 0
order: stop-first
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 5s
timeout: 3s
retries: 3
start_period: 15s
labels:
- "com.toolsku.app=true"
ports:
- "3000-3001:3000"
nginx:
image: nginx:1.27-alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
app:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 10s
timeout: 3s
retries: 3
Nginx反向代理配置(自動發現上游):
upstream app_backend {
server app:3000;
}
server {
listen 80;
server_name app.example.com;
location / {
proxy_pass http://app_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_next_upstream error timeout http_502 http_503;
}
location /health {
access_log off;
return 200 'ok';
add_header Content-Type text/plain;
}
}
零停機更新指令碼:
#!/bin/bash
set -euo pipefail
NEW_IMAGE="myapp:v2.0.0"
echo "🚀 Starting zero-downtime update to ${NEW_IMAGE}"
# 1. 拉取新映像
docker compose pull app
# 2. 啟動新容器(不停舊容器)
docker compose up -d --no-deps --scale app=2 app
# 3. 等待新容器健康
echo "⏳ Waiting for new containers to be healthy..."
sleep 15
# 4. 驗證新容器健康
for i in $(seq 1 30); do
if curl -sf http://localhost:3000/health > /dev/null 2>&1; then
echo "✅ New container is healthy"
break
fi
if [ $i -eq 30 ]; then
echo "❌ Health check failed, rolling back..."
docker compose up -d --no-deps --scale app=1 app
exit 1
fi
sleep 2
done
# 5. 縮減到1個副本
docker compose up -d --no-deps --scale app=1 app
echo "🎉 Update completed successfully"
模式6:Prometheus + Grafana監控棧
問題:生產環境沒有監控就是盲飛,出了問題只能靠使用者回饋。
解決方案:部署完整的Prometheus + Grafana監控棧。
services:
prometheus:
image: prom/prometheus:v2.52.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=5GB'
deploy:
resources:
limits:
memory: 512M
networks:
- monitoring
grafana:
image: grafana/grafana:11.0.0
container_name: grafana
restart: unless-stopped
ports:
- "3001:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
secrets:
- grafana_password
volumes:
- grafana_data:/var/lib/grafana
deploy:
resources:
limits:
memory: 256M
depends_on:
- prometheus
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.8.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
networks:
- monitoring
secrets:
grafana_password:
file: ./secrets/grafana_password.txt
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
Prometheus配置:
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app'
static_configs:
- targets: ['app:3000']
metrics_path: /metrics
告警規則:
# monitoring/alert_rules.yml
groups:
- name: container_alerts
rules:
- alert: ContainerDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.instance }} is down"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / (1024 * 1024) > 400
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage exceeds 400MB on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
模式7:多環境配置(dev/staging/prod)
問題:開發、測試、生產環境配置混雜,改一個環境配置怕影響其他環境。
解決方案:使用 docker-compose.override.yml + 多檔案覆蓋策略。
目錄結構:
project/
├── docker-compose.yml # 基礎配置
├── docker-compose.override.yml # 開發環境覆蓋(自動載入)
├── docker-compose.staging.yml # 測試環境覆蓋
├── docker-compose.prod.yml # 生產環境覆蓋
├── .env # 預設環境變數
├── .env.staging # 測試環境變數
├── .env.prod # 生產環境變數
├── monitoring/
│ ├── prometheus.yml
│ └── alertmanager.yml
└── secrets/
├── db_password.txt
├── api_key.txt
└── grafana_password.txt
基礎配置 docker-compose.yml:
services:
app:
image: myapp:${APP_VERSION:-latest}
environment:
NODE_ENV: ${NODE_ENV:-development}
DATABASE_URL: postgresql://appuser:${DB_PASSWORD}@postgres:5432/appdb
REDIS_URL: redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
networks:
- app-network
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 5s
timeout: 3s
retries: 5
networks:
- app-network
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 5
networks:
- app-network
secrets:
db_password:
file: ./secrets/db_password.txt
volumes:
postgres_data:
networks:
app-network:
driver: bridge
開發環境覆蓋 docker-compose.override.yml(自動載入):
services:
app:
build: .
volumes:
- .:/app
- /app/node_modules
ports:
- "3000:3000"
- "9229:9229"
environment:
NODE_ENV: development
LOG_LEVEL: debug
deploy:
resources:
limits:
memory: 512M
command: node --inspect=0.0.0.0:9229 server.js
adminer:
image: adminer:latest
ports:
- "8080:8080"
networks:
- app-network
生產環境覆蓋 docker-compose.prod.yml:
services:
app:
image: myapp:${APP_VERSION}
ports:
- "3000:3000"
environment:
NODE_ENV: production
LOG_LEVEL: info
deploy:
replicas: 2
resources:
limits:
cpus: '2.0'
memory: 512M
reservations:
cpus: '0.5'
memory: 256M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
logging:
driver: local
options:
max-size: "10m"
max-file: "5"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 15s
postgres:
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
memory: 1G
logging:
driver: local
options:
max-size: "50m"
max-file: "10"
redis:
deploy:
resources:
limits:
memory: 512M
command: redis-server --appendonly yes --maxmemory 400mb --maxmemory-policy allkeys-lru
volumes:
- redis_data:/data
logging:
driver: local
options:
max-size: "10m"
max-file: "3"
nginx:
image: nginx:1.27-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.prod.conf:/etc/nginx/nginx.conf:ro
depends_on:
app:
condition: service_healthy
deploy:
resources:
limits:
memory: 128M
logging:
driver: local
options:
max-size: "50m"
max-file: "10"
volumes:
redis_data:
啟動命令:
# 開發環境(自動載入override)
docker compose up -d
# 測試環境
docker compose -f docker-compose.yml -f docker-compose.staging.yml --env-file .env.staging up -d
# 生產環境
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
5大常見陷阱
陷阱1:depends_on不等於服務就緒
❌ 錯誤寫法:
services:
app:
depends_on:
- postgres
# postgres容器啟動了,但資料庫可能還沒初始化完成
✅ 正確寫法:
services:
app:
depends_on:
postgres:
condition: service_healthy
postgres:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
陷阱2:不設資源限制
❌ 錯誤寫法:
services:
app:
image: myapp:latest
# 沒有任何資源限制,一個記憶體洩漏能吃掉整個宿主機
✅ 正確寫法:
services:
app:
image: myapp:latest
deploy:
resources:
limits:
cpus: '2.0'
memory: 512M
reservations:
cpus: '0.25'
memory: 128M
陷阱3:日誌不輪轉
❌ 錯誤寫法:
services:
app:
image: myapp:latest
# 預設json-file驅動,日誌無限增長
✅ 正確寫法:
services:
app:
image: myapp:latest
logging:
driver: local
options:
max-size: "10m"
max-file: "5"
陷阱4:敏感資訊明文儲存
❌ 錯誤寫法:
services:
postgres:
environment:
POSTGRES_PASSWORD: "my-secret-password-123"
# 密碼明文寫在compose檔案中,提交到Git就完了
✅ 正確寫法:
services:
postgres:
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
secrets:
db_password:
file: ./secrets/db_password.txt
陷阱5:使用latest標籤
❌ 錯誤寫法:
services:
app:
image: myapp:latest
# 每次拉取可能是不同的映像,不可重現
✅ 正確寫法:
services:
app:
image: myapp:2.1.0
# 或者使用變數控制版本
image: myapp:${APP_VERSION:-2.1.0}
錯誤排查速查表
| 錯誤資訊 | 原因 | 解決方案 |
|---|---|---|
OOMKilled |
容器記憶體超限 | 增加 memory 限制或最佳化應用記憶體使用 |
Connection refused |
依賴服務未就緒 | 新增 healthcheck + depends_on.condition |
no space left on device |
日誌/映像佔滿磁碟 | 配置 logging 輪轉 + docker system prune |
restarting 迴圈 |
應用啟動即崩潰 | docker logs <id> 檢視日誌,檢查配置 |
permission denied |
檔案/目錄許可權問題 | 檢查 user 指令和volume許可權 |
port is already allocated |
埠衝突 | 修改埠對映或停止佔用程序 |
unhealthy 狀態 |
健康檢查失敗 | 檢查 healthcheck 命令是否正確 |
secret not found |
Secret檔案不存在 | 確保 secrets/ 目錄下有對應檔案 |
Cannot connect to the Docker daemon |
Docker未執行 | systemctl start docker |
image pulling failed |
映像拉取失敗 | 檢查網路/映像倉庫認證/映像名拼寫 |
進階最佳化
Docker Compose Watch加速開發
Compose Watch可以在檔案變更時自動同步到容器,無需重建映像:
services:
app:
build: .
develop:
watch:
- action: sync
path: ./src
target: /app/src
- action: rebuild
path: ./package.json
- action: sync+restart
path: ./config
target: /app/config
# 啟動watch模式
docker compose watch
網路隔離與安全
services:
app:
networks:
- frontend
- backend
postgres:
networks:
- backend
# postgres不在frontend網路中,外部無法直接存取
nginx:
networks:
- frontend
ports:
- "80:80"
networks:
frontend:
driver: bridge
backend:
driver: bridge
internal: true
# internal:true 禁止外部存取
映像最佳化與多階段建構
# ---- 建構階段 ----
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# ---- 執行階段 ----
FROM node:20-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S appgroup && \
adduser -S appuser -u 1001 -G appgroup
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
編排工具對比
| 特性 | Docker Compose | Kubernetes | Nomad | Docker Swarm |
|---|---|---|---|---|
| 複雜度 | 低 | 高 | 中 | 低 |
| 單機部署 | ✅ 優秀 | ❌ 過重 | ⚠️ 可用 | ✅ 優秀 |
| 多機編排 | ❌ 不支援 | ✅ 核心能力 | ✅ 核心能力 | ⚠️ 基礎 |
| 自動擴縮容 | ❌ | ✅ HPA/VPA | ✅ | ⚠️ 手動 |
| 滾動更新 | ⚠️ 需指令碼 | ✅ 原生 | ✅ 原生 | ✅ 原生 |
| 服務發現 | ⚠️ DNS | ✅ CoreDNS | ✅ Consul | ✅ DNS |
| 儲存編排 | ❌ | ✅ CSI | ✅ CSI | ⚠️ 基礎 |
| 學習曲線 | 低 | 高 | 中 | 低 |
| 適用規模 | 1-10服務 | 100+服務 | 50+服務 | 10-50服務 |
| 生產就緒 | ✅ 單機 | ✅ 大規模 | ✅ 中大規模 | ⚠️ 社群不活躍 |
選型建議:單機/小規模生產環境用Docker Compose,中大規模用Kubernetes,HashiCorp生態用Nomad。Docker Swarm已逐漸邊緣化,新專案不建議選用。
總結
Docker Compose生產部署不是簡單的
docker compose up -d。健康檢查保證服務真正就緒,資源限制防止OOM雪崩,日誌輪轉避免磁碟爆滿,Secrets保護敏感資訊,零停機更新保障7×24可用,監控棧讓你告別盲飛,多環境配置讓dev/staging/prod各得其所。掌握這7大策略,Docker Compose完全能勝任中小規模的生產部署。
推薦工具
本站提供瀏覽器本地工具,免註冊即可試用 →