Kubernetes Troubleshooting in Practice: From Diagnosis to Fix

Kubernetes Troubleshooting Methodology

Kubernetes cluster troubleshooting isn't "guesswork" — it's a systematic, layer-by-layer diagnostic approach. Only by progressing from the application layer to the infrastructure layer can you efficiently pinpoint root causes.

Layer-by-Layer Diagnostic Model

Layer	Scope	Key Commands
L1 Application	Pod status, container logs	`kubectl logs`, `kubectl describe pod`
L2 Service	Service, Endpoint, DNS	`kubectl get endpoints`, `nslookup`
L3 Scheduling	Deployment, ReplicaSet, HPA	`kubectl rollout status`, `kubectl get hpa`
L4 Storage	PVC, PV, StorageClass	`kubectl describe pvc`, `kubectl get pv`
L5 Node	Node status, resource pressure	`kubectl describe node`, `kubectl top nodes`
L6 Infrastructure	Network plugin, API Server, etcd	`systemctl status kubelet`, `crictl ps`

Golden Diagnostic Workflow

# Step 1: Check overall cluster health
kubectl get nodes
kubectl get componentstatuses

# Step 2: Identify the problematic namespace
kubectl get all -n <namespace>

# Step 3: Focus on the abnormal resource
kubectl describe <resource> <name> -n <namespace>

# Step 4: Review events and logs
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl logs <pod-name> -n <namespace> --previous

# Step 5: Enter the container for investigation
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Pod Troubleshooting

Pods are the smallest scheduling unit in Kubernetes and the resource with the highest failure density. Mastering the diagnostic method for each abnormal state is essential.

CrashLoopBackOff

The container exits immediately after starting, and Kubelet repeatedly restarts it, causing a crash loop.

# View current logs
kubectl logs <pod-name> -n <namespace>

# View logs from the last crash (critical!)
kubectl logs <pod-name> -n <namespace> --previous

# Check container exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Last State"

Common Causes and Fixes:

Exit Code	Meaning	Fix
0	Normal exit but no long-running mode set	Check entry command, ensure foreground execution
1	Application error (uncaught exception)	Check application logs, fix code logic
137	OOMKilled (received SIGKILL)	Increase `resources.limits.memory`
139	Segmentation Fault	Check dependency library compatibility
143	SIGTERM graceful termination	Check preStop hook or termination signal handling

# Fix example: Add startup probe to prevent slow starts from being killed
spec:
  containers:
    - name: app
      image: my-app:v1
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 15
        failureThreshold: 3

ImagePullBackOff

Image pull failure, typically caused by image address, authentication, or network issues.

# View detailed error information
kubectl describe pod <pod-name> | grep -A10 "Events"

# Common error messages
# ErrImagePull: registry.k8s.io/pause:3.9 - connection timeout
# ImagePullBackOff: unauthorized: authentication required

Troubleshooting Checklist:

Wrong image address: Check image field spelling, confirm tag exists
Private registry auth: Confirm imagePullSecrets is configured
Network unreachable: Node cannot access image registry (firewall/proxy)
Image not found: Tag misspelled or image not pushed

# Fix: Configure private registry authentication
spec:
  imagePullSecrets:
    - name: registry-credentials
  containers:
    - name: app
      image: harbor.company.com/project/app:v1.2.3

# Create imagePullSecret
kubectl create secret docker-registry registry-credentials \
  --docker-server=harbor.company.com \
  --docker-username=admin \
  --docker-password=Harbor12345 \
  -n <namespace>

OOMKilled

Container memory exceeds limit and is killed by the kernel OOM Killer.

# Confirm OOMKilled
kubectl describe pod <pod-name> | grep -A5 "Last State"
# Last State:     Terminated  Reason: OOMKilled  Exit Code: 137

# View container memory usage trend
kubectl top pod <pod-name> -n <namespace>

# Check node cgroup logs
dmesg | grep -i oom

Fix Strategies:

# Strategy 1: Increase memory limit
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"      # increased from 512Mi to 1Gi
    cpu: "500m"

# Strategy 2: Optimize application memory usage (JVM example)
env:
  - name: JAVA_OPTS
    value: "-Xms256m -Xmx512m -XX:+UseG1GC"

# Strategy 3: Set QoS to Guaranteed (requests == limits)
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Pod Pending

Pod cannot be scheduled to any node, usually due to insufficient resources or unmet constraints.

# View scheduling failure reason
kubectl describe pod <pod-name> | grep -A20 "Events"
# Common messages:
# 0/3 nodes are available: 3 Insufficient cpu.
# 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

Common Causes and Solutions:

Cause	Event Message	Solution
Insufficient CPU	Insufficient cpu	Lower requests or add nodes
Insufficient memory	Insufficient memory	Lower requests or add nodes
Unbound PVC	pod has unbound immediate PersistentVolumeClaims	Troubleshoot PVC status first
Node selector mismatch	node(s) didn't match node selector	Check nodeSelector / nodeAffinity
Taint intolerance	node(s) had taints that the pod didn't tolerate	Add tolerations
PodDisruptionBudget	Cannot evict pod	Check PDB configuration

Service and Networking Troubleshooting

DNS Resolution Failure

# Deploy DNS debug Pod
kubectl run dnsutils --image=registry.k8s.io/e2e-test-images/agnhost:2.39 \
  --restart=Never -- sleep infinity

# Test in-cluster DNS resolution
kubectl exec -it dnsutils -- nslookup kubernetes.default.svc.cluster.local
kubectl exec -it dnsutils -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Common CoreDNS Issues:

CoreDNS Pod not running: Check coredns status in kube-system
ConfigMap misconfiguration: Check coredns ConfigMap plugin configuration
ndots causing timeout: ndots=5 in /etc/resolv.conf causes multiple queries

# Optimize Pod DNS configuration
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
      - name: single-request-reopen
      - name: timeout
        value: "3"
  dnsPolicy: ClusterFirst

Empty Endpoints

Service Endpoint list is empty, traffic cannot route to any Pod.

# Check Endpoints
kubectl get endpoints <service-name> -n <namespace>

# Common cause troubleshooting
# 1. Label selector mismatch
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A5 selector

# 2. Pod not Ready
kubectl get pods -n <namespace>  # Check if READY column shows 1/1

# 3. Port mismatch
kubectl describe svc <service-name> -n <namespace>

# Ensure Service selector matches Pod labels
# Service
spec:
  selector:
    app: my-app        # Must match Pod label
  ports:
    - port: 80
      targetPort: 8080  # Must match container port

# Pod
metadata:
  labels:
    app: my-app        # Corresponds to Service selector
spec:
  containers:
    - name: app
      ports:
        - containerPort: 8080

Connection Refused

# Test connectivity from debug Pod
kubectl run debug --image=busybox --rm -it --restart=Never -- sh
# Inside debug Pod
wget -qO- http://<service-name>.<namespace>:<port>/healthz
curl -v telnet://<service-name>.<namespace>:<port>

# Check if NetworkPolicy is blocking
kubectl get networkpolicy -n <namespace>

# Check kube-proxy mode and iptables/ipvs rules
kubectl logs -n kube-system -l k8s-app=kube-proxy
iptables-save | grep <service-cluster-ip>

Deployment Troubleshooting

Rollback Operations

# View rollout history
kubectl rollout history deployment/<name> -n <namespace>

# Rollback to previous version
kubectl rollout undo deployment/<name> -n <namespace>

# Rollback to specific revision
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=2

# Pause and resume rollout (for canary deployment)
kubectl rollout pause deployment/<name> -n <namespace>
kubectl set image deployment/<name> app=my-app:v2 -n <namespace>
kubectl rollout resume deployment/<name> -n <namespace>

Scaling Failures

# Manual scale
kubectl scale deployment/<name> --replicas=5 -n <namespace>

# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>

# Common HPA failure causes
# 1. metrics-server not deployed
kubectl get pods -n kube-system -l k8s-app=metrics-server

# 2. Resource requests not set (HPA cannot calculate utilization)
kubectl describe pod <pod-name> | grep -A5 "Requests"

# 3. Scaling reached max replica limit
kubectl get hpa <hpa-name> -o yaml | grep maxReplicas

# Correct HPA configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

PVC/PV Storage Troubleshooting

PVC Pending

# View PVC status and events
kubectl describe pvc <pvc-name> -n <namespace>

# Common causes
# 1. No matching PV (static provisioning)
kubectl get pv | grep <storage-class>

# 2. StorageClass doesn't exist (dynamic provisioning)
kubectl get storageclass

# 3. Dynamic provisioner not running
kubectl get pods -n <namespace> | grep provisioner

PV Lost

# View PV status
kubectl get pv
# NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM
# pv-data    50Gi       RWO            Delete           Lost     default/pvc-data

# Fix: Confirm if backend storage has recovered
# If storage has recovered, manually clear Lost status
kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'
# Rebind PVC

AccessMode Mismatch

# PVC requests ReadWriteMany but PV only supports ReadWriteOnce
# PVC
spec:
  accessModes:
    - ReadWriteMany    # Needs RWX
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs

# Confirm StorageClass supported accessMode
kubectl describe storageclass nfs

AccessMode	Abbreviation	Description	Typical Backend
ReadWriteOnce	RWO	Single node read-write	AWS EBS, Ceph RBD
ReadOnlyMany	ROX	Multi-node read-only	NFS
ReadWriteMany	RWX	Multi-node read-write	NFS, CephFS
ReadWriteOncePod	RWOP	Single Pod exclusive read-write	CSI supported

Node Troubleshooting

NotReady Status

# View node status
kubectl get nodes
kubectl describe node <node-name>

# Check kubelet service
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Check container runtime
crictl ps
systemctl status containerd

# Common causes
# 1. Kubelet certificate expired
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# 2. Insufficient disk space
df -h /var/lib/kubelet

# 3. Network plugin abnormal
kubectl get pods -n kube-system -o wide | grep <node-name>

Resource Pressure Conditions

# View node conditions
kubectl describe node <node-name> | grep -A20 "Conditions"

# DiskPressure: Node disk usage exceeds threshold
# MemoryPressure: Node available memory is insufficient
# PIDPressure: Too many processes

# Adjust kubelet eviction thresholds (not recommended to lower in production)
kind: KubeletConfiguration
evictionHard:
  memory.available: "100Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "200Mi"
  nodefs.available: "20%"
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "2m"

kubectl Debugging Cheat Sheet

Resource Viewing

Command	Purpose
`kubectl get all -n <ns>`	View all resources in namespace
`kubectl get pods -o wide`	View Pods with node info
`kubectl get pods --show-labels`	View Pod labels
`kubectl top pods --sort-by=memory`	Sort by memory
`kubectl api-resources`	View all API resource types

Debugging and Troubleshooting

Command	Purpose
`kubectl logs <pod> -c <container> --previous`	Previous container logs
`kubectl logs <pod> --since=1h`	Logs from last 1 hour
`kubectl logs <pod> -f --tail=100`	Live tail last 100 lines
`kubectl exec -it <pod> -- /bin/sh`	Enter container terminal
`kubectl debug <pod> -it --image=busybox`	Temporary debug container
`kubectl port-forward svc/<svc> 8080:80`	Port forward for local debugging

Cluster Management

Command	Purpose
`kubectl cordon <node>`	Mark node unschedulable
`kubectl drain <node> --ignore-daemonsets`	Evict all Pods from node
`kubectl uncordon <node>`	Restore node scheduling
`kubectl taint nodes <node> key=value:NoSchedule`	Add taint
`kubectl label node <node> key=value`	Add label

Logs and Events Analysis

Centralized Log Collection

# Aggregate logs from multi-container Pod
kubectl logs <pod-name> --all-containers=true

# Batch view using label selector
kubectl logs -l app=my-app -n <namespace> --since=5m

# Export logs to file for analysis
kubectl logs <pod-name> -n <namespace> > pod-debug.log

Event Analysis

# View all namespace events (sorted by time)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Filter by type
kubectl get events -n <namespace> --field-selector type=Warning

# Continuously monitor events
kubectl get events -n <namespace> --watch

# View events for specific resource
kubectl events --for pod/<pod-name> -n <namespace>

Log Levels and Formats

# API Server audit logs
cat /var/log/kubernetes/audit.log | jq '. | select(.responseStatus.code >= 400)'

# Kubelet logs
journalctl -u kubelet -f

# etcd logs
kubectl logs -n kube-system etcd-<node-name> --since=10m

Resource Limits and Requests Best Practices

QoS Classes

QoS Class	Condition	Eviction Priority	Use Case
Guaranteed	requests == limits (CPU+memory)	Lowest (killed last)	Core services
Burstable	requests < limits	Medium	General services
BestEffort	No requests/limits set	Highest (killed first)	Batch jobs

# Production recommended configuration template
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

# Use LimitRange to set namespace defaults
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "4Gi"

Common Resource Issues

# View node resource usage
kubectl top nodes
kubectl describe node <node-name> | grep -A10 "Allocated resources"

# View namespace resource quotas
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota -n <namespace>

NetworkPolicy Debugging

Policy Blocking Troubleshooting

# View all NetworkPolicies in namespace
kubectl get networkpolicy -n <namespace>

# View policy details
kubectl describe networkpolicy <policy-name> -n <namespace>

# Common NetworkPolicy configuration error
# Error: Forgot to allow DNS traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

Network Connectivity Testing

# Deploy network debugging tool
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never

# Test inside debug Pod
# Same namespace
curl http://<service-name>:<port>/healthz

# Cross namespace
curl http://<service-name>.<namespace>.svc.cluster.local:<port>/healthz

# Test external connectivity
curl -I https://external-api.example.com

Common Helm Chart Issues

Release Installation Failure

# View Release status
helm status <release-name> -n <namespace>
helm history <release-name> -n <namespace>

# View rendered YAML (without installing)
helm template <chart> --debug

# View failed Release values
helm get values <release-name> -n <namespace>

# Rollback Helm Release
helm rollback <release-name> <revision> -n <namespace>

Common Helm Errors

Error	Cause	Fix
`rendered manifests contain a resource that already exists`	Resource conflict	Use `--replace` or clean up old resources
`Release "xxx" has not been installed yet`	Release doesn't exist	Check namespace and release name
`timed out waiting for the condition`	Pod not Ready	Check `--timeout` value and Pod status
`YAML parse error`	values.yaml syntax error	Validate with /encode/yaml

# Clean up failed Release
helm uninstall <release-name> -n <namespace>

# Force install (dev environment)
helm install <release-name> <chart> --replace -n <namespace>

Prometheus Monitoring Alerts

Key Alert Rules

groups:
  - name: kubernetes-alerts
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="unknown"} == 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is not ready"

      - alert: PVCAlmostFull
        expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is >85% full"

      - alert: DeploymentReplicasMismatch
        expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
        for: 10m
        labels:
          severity: warning

Common PromQL Queries

# Pod memory utilization
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100

# Node CPU utilization
rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100

# Request error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# PVC utilization
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

Emergency Response Playbook

P0: Cluster Unavailable

# 1. Check API Server
kubectl get componentstatuses
curl -k https://<api-server>:6443/healthz

# 2. Check etcd
kubectl logs -n kube-system etcd-<node> --tail=50
ETCDCTL_API=3 etcdctl endpoint health --cluster

# 3. Check kubelet
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"

# 4. Emergency recovery: Restart control plane components
systemctl restart kubelet
systemctl restart containerd

P1: Service Fully Unavailable

# 1. Quick rollback
kubectl rollout undo deployment/<name> -n <namespace>

# 2. Emergency scale up
kubectl scale deployment/<name> --replicas=10 -n <namespace>

# 3. Emergency node maintenance
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

P2: Storage Failure

# 1. Check PV status
kubectl get pv | grep -v Bound

# 2. Check CSI Driver
kubectl get pods -n <csi-namespace>

# 3. Emergency: Force delete stuck Pod
kubectl delete pod <pod-name> --force --grace-period=0 -n <namespace>

FAQ

Q: Pod stuck in ContainerCreating state, what to do?

A: Usually caused by slow image pull or StorageClass mount failure. Use kubectl describe pod to check Events, focusing on image pull and volume mount events.

Q: How to view logs of a deleted Pod?

A: If log aggregation (EFK/PLG) is configured, query through the log platform. Otherwise, deleted Pod logs cannot be recovered — always deploy a log collection solution in production.

Q: kubectl command times out, how to troubleshoot?

A: Usually API Server overload or network issues. Check the server address in ~/.kube/config, test connectivity with kubectl get --request-timeout=5s nodes.

Q: How to quickly identify which node has insufficient resources?

A: Use kubectl top nodes for real-time resources, or kubectl describe node | grep -A5 "Allocated resources" for allocated amounts.

Q: Service ClusterIP unreachable, how to troubleshoot?

A: Check in order: ① kube-proxy running normally ② iptables/ipvs rules correct ③ Endpoints not empty ④ NetworkPolicy not blocking ⑤ CNI plugin status.

Q: How to prevent Pods from being unexpectedly evicted?

A: Set PodDisruptionBudget to guarantee minimum available replicas for critical services.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Q: Helm upgrade failed but Release state is stuck, what to do?

A: Use helm rollback to roll back, or helm history to view previous versions. If the Release is locked, use helm rollback <release> <last-good-revision> to recover.