K8s雲成本暴增?2026年FinOps實踐:5招將Kubernetes費用砍掉60%
DevOps
K8s雲成本暴增?2026年FinOps實踐:5招將Kubernetes費用砍掉60%
月底看雲帳單,K8s費用又超預算30%?CPU利用率只有15%卻按100%付費?Spot實例被回收導致服務中斷?這些場景在2026年的雲原生團隊中太常見了。FinOps不是省錢——而是讓每一分雲支出都產生業務價值。本文將用5個實操策略,幫你把K8s費用砍掉60%。
背景知識:FinOps框架
FinOps(Financial Operations)是將財務問責引入雲消費的實踐框架:
| 階段 | 目標 | 關鍵動作 | 負責人 |
|---|---|---|---|
| Inform(洞察) | 成本視覺化 | 標籤治理、成本分攤、帳單分析 | FinOps團隊 |
| Optimize(最佳化) | 減少浪費 | 右sizing、Spot實例、預留實例 | 工程團隊 |
| Operate(運營) | 持續控制 | 預算告警、自動伸縮、策略執行 | 平台團隊 |
問題分析:K8s成本浪費從哪來?
| 浪費來源 | 佔比 | 典型場景 |
|---|---|---|
| 資源超配(request >> 實際使用) | 40% | CPU request 2核,實際只用0.3核 |
| 空閒資源(無流量仍運行) | 25% | 開發環境24h運行,下班後無流量 |
| 未使用Spot/Preemptible實例 | 20% | 所有Pod都用On-Demand實例 |
| 缺少自動伸縮 | 10% | HPA未配置,低峰期資源不釋放 |
| 儲存浪費 | 5% | PVC過大,未清理的日誌和映像 |
策略一:資源右sizing
第1步:安裝metrics-server和kube-resource-report
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
helm install kube-resource-report helm/kube-resource-report \
--set prometheus.url=http://prometheus:9090
第2步:分析資源利用率
# 查看所有Pod的資源使用率
kubectl top pods -A --sort-by=cpu
# 使用kubectl-resource-capacity對比request和實際使用
kubectl resource-capacity --sort cpu.request --pods
第3步:自動右sizing推薦
apiVersion: apps.kubecost.com/v1beta1
kind: RightSizingRecommendation
metadata:
name: all-deployments
spec:
targetRef:
kind: Deployment
name: order-service
current:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
recommended:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
savingsPercent: 85
第4步:基於P99的右sizing腳本
# right_sizing.py
import subprocess
import json
from datetime import datetime, timedelta
def get_pod_metrics(namespace: str, days: int = 7) -> dict:
"""获取过去N天的Pod资源使用P99"""
end_time = datetime.now()
start_time = end_time - timedelta(days=days)
query = f'sum(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"}}[5m])) by (pod)'
result = subprocess.run([
'kubectl', 'get', '--raw',
f'/apis/metrics.k8s.io/v1beta1/namespaces/{namespace}/pods'
], capture_output=True, text=True)
pods = json.loads(result.stdout)
metrics = {}
for item in pods.get('items', []):
pod_name = item['metadata']['name']
containers = item['containers']
total_cpu = 0
total_mem = 0
for c in containers:
usage = c.get('usage', {})
cpu_str = usage.get('cpu', '0m')
mem_str = usage.get('memory', '0Ki')
total_cpu += parse_cpu(cpu_str)
total_mem += parse_memory(mem_str)
metrics[pod_name] = {'cpu_millicores': total_cpu, 'memory_mib': total_mem}
return metrics
def parse_cpu(s: str) -> int:
if s.endswith('m'):
return int(s[:-1])
return int(float(s) * 1000)
def parse_memory(s: str) -> int:
if s.endswith('Ki'):
return int(s[:-2]) // 1024
if s.endswith('Mi'):
return int(s[:-2])
if s.endswith('Gi'):
return int(s[:-2]) * 1024
return int(s) // (1024 * 1024)
def generate_recommendations(metrics: dict, buffer_percent: int = 20) -> list:
"""生成右sizing推荐,加buffer_percent的缓冲"""
recommendations = []
for pod, usage in metrics.items():
cpu_rec = int(usage['cpu_millicores'] * (1 + buffer_percent / 100))
mem_rec = int(usage['memory_mib'] * (1 + buffer_percent / 100))
recommendations.append({
'pod': pod,
'recommended_cpu': f'{cpu_rec}m',
'recommended_memory': f'{mem_rec}Mi',
'current_cpu': usage['cpu_millicores'],
'current_memory': usage['memory_mib'],
})
return recommendations
if __name__ == '__main__':
metrics = get_pod_metrics('production')
recs = generate_recommendations(metrics, buffer_percent=20)
for r in recs:
print(f"{r['pod']}: CPU {r['recommended_cpu']}, Memory {r['recommended_memory']}")
策略二:Spot/Preemptible實例
# spot-node-pool.yaml
apiVersion: v1
kind: Node
metadata:
name: spot-pool
labels:
cloud.google.com/gke-provisioning: spot
node-type: spot
annotations:
cloud.google.com/spot: "true"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 5
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
nodeSelector:
node-type: spot
tolerations:
- key: "cloud.google.com/gke-provisioning"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
containers:
- name: processor
image: myapp/batch-processor:latest
resources:
requests:
cpu: "500m"
memory: "1Gi"
terminationGracePeriodSeconds: 60
Spot實例中斷處理
apiVersion: apps/v1
kind: Deployment
metadata:
name: Spot-aware-service
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: myapp:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: spot-aware-service
策略三:集群自動伸縮
# cluster-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
selector:
matchLabels:
app: cluster-autoscaler
template:
spec:
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.30.0
name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --scale-down-delay-after-add=5m
- --scale-down-unneeded-time=5m
- --scale-down-utilization-threshold=0.5
- --max-nodes-total=50
- --min-nodes-total=3
- --balance-similar-node-groups
- --expander=priority
env:
- name: CA_SKIP_NODES_WITH_LOCAL_STORAGE
value: "false"
HPA + VPA組合
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
---
apiVersion: autoscaling.k8s.io/v1beta2
kind: VerticalPodAutoscaler
metadata:
name: api-gateway-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: "2"
memory: 4Gi
策略四:成本監控與告警
# kubecost-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kubecost-cost-model
namespace: kubecost
data:
cost-model.json: |
{
"clusterName": "production",
"defaultCPUPrice": "0.031611",
"defaultRAMPrice": "0.004237",
"spotLabel": "cloud.google.com/gke-provisioning",
"spotLabelValue": "spot",
"spotDiscount": 0.6
}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cost-alerts
data:
alerts.json: |
[
{
"name": "daily-budget-alert",
"type": "budget",
"threshold": 500,
"window": "1d",
"aggregation": "cluster",
"notification": {
"type": "slack",
"channel": "#finops-alerts"
}
},
{
"name": "namespace-spike-alert",
"type": "spendChange",
"threshold": 0.3,
"window": "7d",
"baselineWindow": "7d",
"aggregation": "namespace",
"notification": {
"type": "email",
"email": "finops-team@company.com"
}
}
]
策略五:開發環境定時伸縮
# dev-namespace-schedule.yaml
apiVersion: zalando.org/v1
kind: ScheduleSwitch
metadata:
name: dev-environment-schedule
namespace: dev
spec:
switches:
- startTime: "0 8 * * 1-5"
endTime: "0 20 * * 1-5"
replicas: 1
description: "工作日8:00-20:00運行"
- startTime: "0 20 * * 1-5"
endTime: "0 8 * * 1-5"
replicas: 0
description: "非工作時間縮容到0"
- startTime: "0 0 * * 0,6"
endTime: "0 0 * * 1"
replicas: 0
description: "週末縮容到0"
避坑指南
| 序號 | 坑點 | 症狀 | 解決方案 |
|---|---|---|---|
| 1 | VPA Auto模式導致Pod頻繁重啟 | 服務可用性下降 | 先用Off模式觀察推薦值,確認後切換為Auto |
| 2 | Spot實例回收導致批量Pod中斷 | 服務大面積503 | 使用topologySpreadConstraints跨AZ分佈,設定preStop優雅退出 |
| 3 | HPA和VPA同時作用產生衝突 | 副本數和資源量反覆震盪 | HPA管副本數,VPA管資源量,不要對同一指標同時使用 |
| 4 | Cluster Autoscaler縮容關鍵節點 | 有狀態Pod被驅逐 | 給關鍵節點打 cluster-autoscaler.kubernetes.io/safe-to-evict: "false" 註解 |
| 5 | 成本分攤標籤不一致 | 無法按團隊/專案歸集成本 | 建立標籤策略(Label Policy),用Kyverno/OPA強制執行 |
報錯排查
| 報錯資訊 | 原因 | 解決方法 |
|---|---|---|
metrics-server: no metrics available |
metrics-server未安裝或未就緒 | 安裝metrics-server,檢查 kubectl top nodes 是否正常 |
VPA recommender: OOMKilled |
VPA推薦器記憶體不足 | 增大VPA recommender的memory request |
HPA: unable to get metric |
自訂指標未註冊 | 檢查Prometheus Adapter和自訂指標API |
ClusterAutoscaler: node group not found |
節點組配置錯誤 | 檢查 --nodes 參數格式:min:max:node-group-name |
Spot node: preempted |
Spot實例被雲廠商回收 | 正常行為,確保有足夠的PodDisruptionBudget |
PodDisruptionBudget: not enough replicas |
PDB過嚴阻止縮容 | 調整PDB的 minAvailable 或 maxUnavailable |
Kubecost: pricing data not available |
雲定價API不可達 | 配置自訂定價或使用 defaultCPUPrice/defaultRAMPrice |
Scale to 0: jobs still running |
有Job未完成阻止縮容 | 等待Job完成或設定 activeDeadlineSeconds |
ResourceQuota: exceeded quota |
右sizing後超出租戶配額 | 調整ResourceQuota或與平台團隊協調 |
Node: NotReady after scale-up |
新節點初始化失敗 | 檢查節點啟動腳本和初始化容器 |
進階最佳化
1. Reserved Instance / Savings Plans
| 方案 | 折扣 | 靈活度 | 適用 |
|---|---|---|---|
| On-Demand | 0% | 最高 | 臨時/突發負載 |
| Spot/Preemptible | 60-90% | 低(可回收) | 無狀態/可中斷 |
| 1年Reserved | 30-40% | 中 | 穩定基線負載 |
| 3年Reserved | 50-60% | 低 | 長期核心服務 |
| Savings Plans | 20-40% | 高(跨實例) | 混合工作負載 |
2. 多集群成本最佳化
apiVersion: kubecost.com/v1
kind: MultiClusterCost
spec:
clusters:
- name: us-east-prod
apiEndpoint: https://k8s-us-east.example.com
costWeight: 1.0
- name: eu-west-prod
apiEndpoint: https://k8s-eu-west.example.com
costWeight: 0.8
aggregation:
byTeam: true
byService: true
byNamespace: true
3. 智慧排程最佳化
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: cost-aware-scheduler
plugins:
score:
enabled:
- name: NodeResourcesLeastAllocated
- name: CostAwarePriority
disabled:
- name: NodeResourcesMostAllocated
對比分析
| FinOps工具 | 成本可見性 | 右sizing推薦 | Spot支援 | 開源 | 定價 |
|---|---|---|---|---|---|
| Kubecost | ★★★★★ | ★★★★★ | ★★★★ | 是 | 免費/企業$ |
| CloudHealth | ★★★★★ | ★★★★ | ★★★ | 否 | 按雲支出% |
| AWS Cost Explorer | ★★★★ | ★★★ | ★★★ | 否 | 免費 |
| Prometheus + Grafana | ★★★ | ★★ | ★ | 是 | 免費 |
| Vantage | ★★★★ | ★★★★ | ★★★ | 否 | 按seat |
總結:K8s成本最佳化不是一刀切,而是FinOps三階段的系統工程——Inform階段讓成本透明,Optimize階段用右sizing+Spot+自動伸縮砍掉浪費,Operate階段用告警和策略持續控制。5個策略組合使用:右sizing省40%、Spot實例省20%、自動伸縮省15%、開發環境定時省10%、成本監控防反彈5%——加起來就是60%。2026年,不懂FinOps的K8s運維,就像不會看帳單的購物狂。
線上工具推薦
- Cron定時任務配置:/zh-TW/dev/cron-expression
- JSON資料格式化:/zh-TW/json/format
- Base64編碼解碼:/zh-TW/encode/base64
本站提供瀏覽器本地工具,免註冊即可試用 →
#Kubernetes#FinOps#成本优化#云成本#资源调度#Spot实例#右sizing#云原生