K8s HPA Autoscaling: 7 Key Tuning Strategies from…

Your K8s Cluster Is a Paper Tiger Under Peak Traffic

At 3 AM traffic spikes, Pods get OOM Killed; during promotions HPA scales frantically, instantly overwhelming the database connection pool; scale-down is too aggressive, newly created Pods are killed before they finish warming up. Kubernetes HPA (Horizontal Pod Autoscaler) is not just about setting a CPU threshold — the default metric windows, scaling policies, and cooldown periods are all designed for demos, and going straight to production is a disaster.

This article starts from HPA basics and guides you through metrics configuration → custom metrics → scaling behavior tuning → production stability with 7 key tuning strategies, from development to production.

HPA Core Concepts

Concept	Description
Horizontal Pod Autoscaler	Automatically adjusts Pod replica count based on metrics
Metrics Server	Resource metrics collector, provides CPU/memory and other basic metrics
Custom Metrics	Custom metrics like QPS, queue depth, connection count
External Metrics	External metrics like message queue length, cloud service metrics
Target Utilization	Target utilization, HPA maintains metrics near target value
Scale Target Ref	Scaling target reference, pointing to Deployment/StatefulSet etc.
Behavior	Scaling behavior configuration, controlling scaling speed and policy
Stabilization Window	Stability window, prevents frequent scaling from metric fluctuation
Cooldown/Delay	Scaling cooldown, minimum interval between two scaling operations
VPA	Vertical Pod Autoscaler, adjusts Pod resource requests

HPA Workflow

1. HPA controller fetches metrics from Metrics Server every 15s (default)
2. Calculates ratio of current metric value to target value
3. Calculates desired replicas: desiredReplicas = ceil[currentReplicas * (currentMetric / targetMetric)]
4. Applies Behavior policies to limit scaling speed
5. Updates Scale Target's replicas field
6. Deployment controller creates/deletes Pods

Problem Analysis: 5 Major HPA Production Challenges

Metric latency: Metrics Server default 30s collection interval, metric lag during traffic bursts causes delayed scaling
Scaling oscillation: Metrics fluctuate around threshold, Pods frequently created/destroyed, affecting service stability
Missing custom metrics: CPU/memory can't truly reflect business load, need QPS, queue depth and other business metrics
Scale-down avalanche: Scaling down too fast causes newly established connections to be interrupted, request failure rate spikes
Inaccurate resource requests: Pod resources.requests set unreasonably, HPA's percentage-based calculation becomes distorted

Step-by-Step: 7 Key Tuning Strategies

Strategy 1: Basic HPA Configuration — CPU/Memory Metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      selectPolicy: Max
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      selectPolicy: Min
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Strategy 2: Custom Metrics — Prometheus Adapter

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.custom.metrics.k8s.io
spec:
  service:
    name: prometheus-adapter
    namespace: monitoring
  group: custom.metrics.k8s.io
  version: v1beta1
  insecureSkipTLSVerify: true
  groupPriorityMinimum: 100
  versionPriority: 100
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
      - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          matches: "^(.*)_total"
          as: "${1}_per_second"
        metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
      - seriesQuery: 'grpc_server_handled{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          matches: "^(.*)_handled"
          as: "${1}_per_second"
        metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-custom-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 3
  maxReplicas: 100
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Strategy 3: Fine-Grained Scaling Behavior Tuning

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 5
  maxReplicas: 200
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      selectPolicy: Max
      policies:
        - type: Percent
          value: 200
          periodSeconds: 60
        - type: Pods
          value: 10
          periodSeconds: 60
        - type: Percent
          value: 50
          periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 600
      selectPolicy: Min
      policies:
        - type: Percent
          value: 5
          periodSeconds: 120
        - type: Pods
          value: 1
          periodSeconds: 120

Strategy 4: External Metrics — Message Queue Depth Driven Scaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-worker-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: rabbitmq_queue_messages
          selector:
            matchLabels:
              queue: order-processing
        target:
          type: AverageValue
          averageValue: "30"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      selectPolicy: Max
      policies:
        - type: Pods
          value: 5
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      selectPolicy: Min
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Strategy 5: Multi-Metric Combination Strategy

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 100
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "2000"
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 75
    - type: External
      external:
        metric:
          name: redis_connected_clients
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      selectPolicy: Max
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      selectPolicy: Min
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120

Strategy 6: VPA and HPA Coordination

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
      - containerName: web-api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: "4"
          memory: 4Gi
        controlledResources:
          - cpu
          - memory
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

Strategy 7: Production Readiness Checklist

apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-ready-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: production-ready-app
  template:
    metadata:
      labels:
        app: production-ready-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: app
          image: app:1.0.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: "1"
              memory: 1Gi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: production-ready-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: production-ready-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      selectPolicy: Max
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      selectPolicy: Min
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120

Pitfall Guide

Pitfall 1: Missing resources.requests Prevents HPA from Working

# ❌ Wrong: no requests set, HPA cannot calculate utilization
resources:
  limits:
    cpu: "1"
    memory: 1Gi

# ✅ Correct: must set requests
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: "1"
    memory: 1Gi

Pitfall 2: Scale-Down Stabilization Window Too Short Causes Oscillation

# ❌ Wrong: default 0 second stabilization window, scales down on metric fluctuation
behavior:
  scaleDown:
    stabilizationWindowSeconds: 0

# ✅ Correct: at least 300 seconds in production
behavior:
  scaleDown:
    stabilizationWindowSeconds: 600
    selectPolicy: Min
    policies:
      - type: Percent
        value: 10
        periodSeconds: 120

Pitfall 3: maxReplicas Too High Causes Resource Exhaustion

# ❌ Wrong: no upper limit protection
spec:
  maxReplicas: 1000

# ✅ Correct: set reasonable upper limit based on cluster capacity, with LimitRange and ResourceQuota
spec:
  maxReplicas: 50
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    pods: "200"

Pitfall 4: No readinessProbe Causes Traffic to Hit Unready Pods

# ❌ Wrong: no readinessProbe, new Pods receive traffic immediately
spec:
  containers:
    - name: app
      image: app:1.0.0

# ✅ Correct: configure readinessProbe to ensure Pod is ready before receiving traffic
spec:
  containers:
    - name: app
      image: app:1.0.0
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 10

Pitfall 5: VPA and HPA Using Same CPU Metric Causes Conflicts

# ❌ Wrong: VPA and HPA both based on CPU metrics, interfering with each other
# VPA adjusts CPU requests → HPA recalculates utilization → triggers scaling again

# ✅ Correct: HPA uses custom metrics, VPA manages resource requests
# HPA: based on business metrics like QPS
# VPA: based on CPU/memory resource metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
spec:
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

Error Troubleshooting

#	Error Message	Cause	Solution
1	`the HPA was unable to compute the replica count`	Metrics Server not installed or unavailable	Install Metrics Server, check kubectl top pods
2	`missing request for cpu`	Pod missing resources.requests	Add resources.requests.cpu to container
3	`failed to get cpu utilization`	Metric collection delay	Wait 1-2 minutes, check Metrics Server logs
4	`the desired replica count is below the minimum`	Load below minReplicas	Normal behavior, HPA won't scale below minReplicas
5	`the desired replica count is above the maximum`	Load exceeds maxReplicas	Increase maxReplicas or optimize service performance
6	`invalid metrics source`	Custom metrics API not registered	Install Prometheus Adapter, check APIService status
7	`could not resolve external metric`	External metric query failed	Check metric name and selector, confirm Prometheus has data
8	`scaling limited because of pod disruption budget`	PDB blocking scale-down	Adjust PDB's minAvailable or maxUnavailable
9	`back-off period: scaling is rate limited`	Within scaling cooldown	Wait for cooldown, or adjust behavior.policies.periodSeconds
10	`insufficient quota to scale`	Namespace resource quota insufficient	Increase ResourceQuota or decrease maxReplicas

Advanced Optimization

1. Prediction-Based Autoscaling (KEDA)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-processor
  minReplicaCount: 2
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
    - type: rabbitmq
      metadata:
        queueName: order-processing
        host: amqp://rabbitmq.production.svc:5672
        queueLength: "30"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: http_request_duration_seconds_p99
        threshold: "0.5"
        query: "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace='production'}[5m])) by (le))"

2. Pod Priority and Preemption for Critical Services

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-service
value: 1000000
globalDefault: false
description: "Critical service priority"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-job
value: 100
globalDefault: false
description: "Batch job priority"
---
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  priorityClassName: critical-service
  containers:
    - name: app
      image: app:1.0.0
      resources:
        requests:
          cpu: 500m
          memory: 512Mi

3. HPA Monitoring and Alerting

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: hpa-alerts
  namespace: monitoring
spec:
  groups:
    - name: hpa.rules
      rules:
        - alert: HPAAtMaxReplicas
          expr: kube_hpa_status_current_replicas == kube_hpa_spec_max_replicas
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "HPA {{ $labels.hpa }} has reached max replicas"
            description: "HPA {{ $labels.hpa }} in namespace {{ $labels.namespace }} is at max replicas ({{ $value }}) for 10 minutes"
        - alert: HPAUnstableScaling
          expr: |
            count_over_time(kube_hpa_status_current_replicas[30m]) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "HPA {{ $labels.hpa }} is scaling frequently"
        - alert: HPAMetricsUnavailable
          expr: kube_hpa_status_condition{condition="ScalingLimited",status="true"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "HPA {{ $labels.hpa }} metrics unavailable"

Comparison Analysis

Dimension	HPA	VPA	KEDA	Cluster Autoscaler	Knative
Scaling Dimension	Horizontal (replicas)	Vertical (resource size)	Horizontal + event-driven	Node count	Horizontal + scale to zero
Metric Types	CPU/Memory/Custom	CPU/Memory	50+ event sources	Node resources	Concurrent requests
Scale to Zero	❌	❌	✅	❌	✅
Real-time	15s-60s	Minutes	Seconds	Minutes	Seconds
Production Maturity	✅ GA	✅ GA	✅ CNCF Incubating	✅ GA	✅ GA
Complexity	Low	Medium	Medium	High	High
Use Case	Stateless services	Resource tuning	Event-driven	Cluster capacity	Serverless

Summary: HPA isn't "set a CPU threshold and you're done" — it's "a systems engineering project from metric selection to scaling behavior to cluster capacity." Core principles: use business metrics (QPS/queue depth) not resource metrics to drive scaling — high CPU is a result, not a cause; scale-down must be conservative — stabilizationWindowSeconds at least 300 seconds, scale-down rate no more than 10%/2 minutes; VPA manages resource requests, HPA manages replica count, use different metrics to avoid conflicts; production environments must have ResourceQuota as a safety net to prevent HPA from infinitely scaling and exhausting cluster resources.

Recommended Online Tools

JSON Formatter: /en/json/format
Base64 Encode/Decode: /en/encode/base64
Hash Calculator: /en/encode/hash
JWT Decode: /en/encode/jwt-decode