K8s HPA Autoscaling: 7 Key Tuning Strategies from Metrics Configuration to Production Stability
Your K8s Cluster Is a Paper Tiger Under Peak Traffic
At 3 AM traffic spikes, Pods get OOM Killed; during promotions HPA scales frantically, instantly overwhelming the database connection pool; scale-down is too aggressive, newly created Pods are killed before they finish warming up. Kubernetes HPA (Horizontal Pod Autoscaler) is not just about setting a CPU threshold — the default metric windows, scaling policies, and cooldown periods are all designed for demos, and going straight to production is a disaster.
This article starts from HPA basics and guides you through metrics configuration → custom metrics → scaling behavior tuning → production stability with 7 key tuning strategies, from development to production.
HPA Core Concepts
| Concept | Description |
|---|---|
| Horizontal Pod Autoscaler | Automatically adjusts Pod replica count based on metrics |
| Metrics Server | Resource metrics collector, provides CPU/memory and other basic metrics |
| Custom Metrics | Custom metrics like QPS, queue depth, connection count |
| External Metrics | External metrics like message queue length, cloud service metrics |
| Target Utilization | Target utilization, HPA maintains metrics near target value |
| Scale Target Ref | Scaling target reference, pointing to Deployment/StatefulSet etc. |
| Behavior | Scaling behavior configuration, controlling scaling speed and policy |
| Stabilization Window | Stability window, prevents frequent scaling from metric fluctuation |
| Cooldown/Delay | Scaling cooldown, minimum interval between two scaling operations |
| VPA | Vertical Pod Autoscaler, adjusts Pod resource requests |
HPA Workflow
1. HPA controller fetches metrics from Metrics Server every 15s (default)
2. Calculates ratio of current metric value to target value
3. Calculates desired replicas: desiredReplicas = ceil[currentReplicas * (currentMetric / targetMetric)]
4. Applies Behavior policies to limit scaling speed
5. Updates Scale Target's replicas field
6. Deployment controller creates/deletes Pods
Problem Analysis: 5 Major HPA Production Challenges
- Metric latency: Metrics Server default 30s collection interval, metric lag during traffic bursts causes delayed scaling
- Scaling oscillation: Metrics fluctuate around threshold, Pods frequently created/destroyed, affecting service stability
- Missing custom metrics: CPU/memory can't truly reflect business load, need QPS, queue depth and other business metrics
- Scale-down avalanche: Scaling down too fast causes newly established connections to be interrupted, request failure rate spikes
- Inaccurate resource requests: Pod resources.requests set unreasonably, HPA's percentage-based calculation becomes distorted
Step-by-Step: 7 Key Tuning Strategies
Strategy 1: Basic HPA Configuration — CPU/Memory Metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
Strategy 2: Custom Metrics — Prometheus Adapter
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.custom.metrics.k8s.io
spec:
service:
name: prometheus-adapter
namespace: monitoring
group: custom.metrics.k8s.io
version: v1beta1
insecureSkipTLSVerify: true
groupPriorityMinimum: 100
versionPriority: 100
---
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
- seriesQuery: 'grpc_server_handled{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_handled"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-api-custom-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
minReplicas: 3
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Strategy 3: Fine-Grained Scaling Behavior Tuning
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 5
maxReplicas: 200
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 200
periodSeconds: 60
- type: Pods
value: 10
periodSeconds: 60
- type: Percent
value: 50
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 600
selectPolicy: Min
policies:
- type: Percent
value: 5
periodSeconds: 120
- type: Pods
value: 1
periodSeconds: 120
Strategy 4: External Metrics — Message Queue Depth Driven Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-worker-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: rabbitmq_queue_messages
selector:
matchLabels:
queue: order-processing
target:
type: AverageValue
averageValue: "30"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
selectPolicy: Max
policies:
- type: Pods
value: 5
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Pods
value: 1
periodSeconds: 120
Strategy 5: Multi-Metric Combination Strategy
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 5
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "2000"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
- type: External
external:
metric:
name: redis_connected_clients
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 120
Strategy 6: VPA and HPA Coordination
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: web-api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: "4"
memory: 4Gi
controlledResources:
- cpu
- memory
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
Strategy 7: Production Readiness Checklist
apiVersion: apps/v1
kind: Deployment
metadata:
name: production-ready-app
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: production-ready-app
template:
metadata:
labels:
app: production-ready-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: app:1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: production-ready-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: production-ready-app
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 120
Pitfall Guide
Pitfall 1: Missing resources.requests Prevents HPA from Working
# ❌ Wrong: no requests set, HPA cannot calculate utilization
resources:
limits:
cpu: "1"
memory: 1Gi
# ✅ Correct: must set requests
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
Pitfall 2: Scale-Down Stabilization Window Too Short Causes Oscillation
# ❌ Wrong: default 0 second stabilization window, scales down on metric fluctuation
behavior:
scaleDown:
stabilizationWindowSeconds: 0
# ✅ Correct: at least 300 seconds in production
behavior:
scaleDown:
stabilizationWindowSeconds: 600
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 120
Pitfall 3: maxReplicas Too High Causes Resource Exhaustion
# ❌ Wrong: no upper limit protection
spec:
maxReplicas: 1000
# ✅ Correct: set reasonable upper limit based on cluster capacity, with LimitRange and ResourceQuota
spec:
maxReplicas: 50
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
pods: "200"
Pitfall 4: No readinessProbe Causes Traffic to Hit Unready Pods
# ❌ Wrong: no readinessProbe, new Pods receive traffic immediately
spec:
containers:
- name: app
image: app:1.0.0
# ✅ Correct: configure readinessProbe to ensure Pod is ready before receiving traffic
spec:
containers:
- name: app
image: app:1.0.0
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Pitfall 5: VPA and HPA Using Same CPU Metric Causes Conflicts
# ❌ Wrong: VPA and HPA both based on CPU metrics, interfering with each other
# VPA adjusts CPU requests → HPA recalculates utilization → triggers scaling again
# ✅ Correct: HPA uses custom metrics, VPA manages resource requests
# HPA: based on business metrics like QPS
# VPA: based on CPU/memory resource metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-api-hpa
spec:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
Error Troubleshooting
| # | Error Message | Cause | Solution |
|---|---|---|---|
| 1 | the HPA was unable to compute the replica count |
Metrics Server not installed or unavailable | Install Metrics Server, check kubectl top pods |
| 2 | missing request for cpu |
Pod missing resources.requests | Add resources.requests.cpu to container |
| 3 | failed to get cpu utilization |
Metric collection delay | Wait 1-2 minutes, check Metrics Server logs |
| 4 | the desired replica count is below the minimum |
Load below minReplicas | Normal behavior, HPA won't scale below minReplicas |
| 5 | the desired replica count is above the maximum |
Load exceeds maxReplicas | Increase maxReplicas or optimize service performance |
| 6 | invalid metrics source |
Custom metrics API not registered | Install Prometheus Adapter, check APIService status |
| 7 | could not resolve external metric |
External metric query failed | Check metric name and selector, confirm Prometheus has data |
| 8 | scaling limited because of pod disruption budget |
PDB blocking scale-down | Adjust PDB's minAvailable or maxUnavailable |
| 9 | back-off period: scaling is rate limited |
Within scaling cooldown | Wait for cooldown, or adjust behavior.policies.periodSeconds |
| 10 | insufficient quota to scale |
Namespace resource quota insufficient | Increase ResourceQuota or decrease maxReplicas |
Advanced Optimization
1. Prediction-Based Autoscaling (KEDA)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaler
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-processor
minReplicaCount: 2
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: rabbitmq
metadata:
queueName: order-processing
host: amqp://rabbitmq.production.svc:5672
queueLength: "30"
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: http_request_duration_seconds_p99
threshold: "0.5"
query: "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace='production'}[5m])) by (le))"
2. Pod Priority and Preemption for Critical Services
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-service
value: 1000000
globalDefault: false
description: "Critical service priority"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-job
value: 100
globalDefault: false
description: "Batch job priority"
---
apiVersion: v1
kind: Pod
metadata:
name: critical-app
spec:
priorityClassName: critical-service
containers:
- name: app
image: app:1.0.0
resources:
requests:
cpu: 500m
memory: 512Mi
3. HPA Monitoring and Alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: hpa-alerts
namespace: monitoring
spec:
groups:
- name: hpa.rules
rules:
- alert: HPAAtMaxReplicas
expr: kube_hpa_status_current_replicas == kube_hpa_spec_max_replicas
for: 10m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.hpa }} has reached max replicas"
description: "HPA {{ $labels.hpa }} in namespace {{ $labels.namespace }} is at max replicas ({{ $value }}) for 10 minutes"
- alert: HPAUnstableScaling
expr: |
count_over_time(kube_hpa_status_current_replicas[30m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.hpa }} is scaling frequently"
- alert: HPAMetricsUnavailable
expr: kube_hpa_status_condition{condition="ScalingLimited",status="true"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "HPA {{ $labels.hpa }} metrics unavailable"
Comparison Analysis
| Dimension | HPA | VPA | KEDA | Cluster Autoscaler | Knative |
|---|---|---|---|---|---|
| Scaling Dimension | Horizontal (replicas) | Vertical (resource size) | Horizontal + event-driven | Node count | Horizontal + scale to zero |
| Metric Types | CPU/Memory/Custom | CPU/Memory | 50+ event sources | Node resources | Concurrent requests |
| Scale to Zero | ❌ | ❌ | ✅ | ❌ | ✅ |
| Real-time | 15s-60s | Minutes | Seconds | Minutes | Seconds |
| Production Maturity | ✅ GA | ✅ GA | ✅ CNCF Incubating | ✅ GA | ✅ GA |
| Complexity | Low | Medium | Medium | High | High |
| Use Case | Stateless services | Resource tuning | Event-driven | Cluster capacity | Serverless |
Summary: HPA isn't "set a CPU threshold and you're done" — it's "a systems engineering project from metric selection to scaling behavior to cluster capacity." Core principles: use business metrics (QPS/queue depth) not resource metrics to drive scaling — high CPU is a result, not a cause; scale-down must be conservative — stabilizationWindowSeconds at least 300 seconds, scale-down rate no more than 10%/2 minutes; VPA manages resource requests, HPA manages replica count, use different metrics to avoid conflicts; production environments must have ResourceQuota as a safety net to prevent HPA from infinitely scaling and exhausting cluster resources.
Recommended Online Tools
- JSON Formatter: /en/json/format
- Base64 Encode/Decode: /en/encode/base64
- Hash Calculator: /en/encode/hash
- JWT Decode: /en/encode/jwt-decode
Try these browser-local tools — no sign-up required →