Kubernetes Troubleshooting in Practice: From Diagnosis to Fix

DevOps

Kubernetes Troubleshooting Methodology

Kubernetes cluster troubleshooting isn't "guesswork" — it's a systematic, layer-by-layer diagnostic approach. Only by progressing from the application layer to the infrastructure layer can you efficiently pinpoint root causes.

Layer-by-Layer Diagnostic Model

Layer Scope Key Commands
L1 Application Pod status, container logs kubectl logs, kubectl describe pod
L2 Service Service, Endpoint, DNS kubectl get endpoints, nslookup
L3 Scheduling Deployment, ReplicaSet, HPA kubectl rollout status, kubectl get hpa
L4 Storage PVC, PV, StorageClass kubectl describe pvc, kubectl get pv
L5 Node Node status, resource pressure kubectl describe node, kubectl top nodes
L6 Infrastructure Network plugin, API Server, etcd systemctl status kubelet, crictl ps

Golden Diagnostic Workflow

# Step 1: Check overall cluster health
kubectl get nodes
kubectl get componentstatuses

# Step 2: Identify the problematic namespace
kubectl get all -n <namespace>

# Step 3: Focus on the abnormal resource
kubectl describe <resource> <name> -n <namespace>

# Step 4: Review events and logs
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl logs <pod-name> -n <namespace> --previous

# Step 5: Enter the container for investigation
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Pod Troubleshooting

Pods are the smallest scheduling unit in Kubernetes and the resource with the highest failure density. Mastering the diagnostic method for each abnormal state is essential.

CrashLoopBackOff

The container exits immediately after starting, and Kubelet repeatedly restarts it, causing a crash loop.

# View current logs
kubectl logs <pod-name> -n <namespace>

# View logs from the last crash (critical!)
kubectl logs <pod-name> -n <namespace> --previous

# Check container exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Last State"

Common Causes and Fixes:

Exit Code Meaning Fix
0 Normal exit but no long-running mode set Check entry command, ensure foreground execution
1 Application error (uncaught exception) Check application logs, fix code logic
137 OOMKilled (received SIGKILL) Increase resources.limits.memory
139 Segmentation Fault Check dependency library compatibility
143 SIGTERM graceful termination Check preStop hook or termination signal handling
# Fix example: Add startup probe to prevent slow starts from being killed
spec:
  containers:
    - name: app
      image: my-app:v1
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        failureThreshold: 30
        periodSeconds: 10
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 15
        failureThreshold: 3

ImagePullBackOff

Image pull failure, typically caused by image address, authentication, or network issues.

# View detailed error information
kubectl describe pod <pod-name> | grep -A10 "Events"

# Common error messages
# ErrImagePull: registry.k8s.io/pause:3.9 - connection timeout
# ImagePullBackOff: unauthorized: authentication required

Troubleshooting Checklist:

  1. Wrong image address: Check image field spelling, confirm tag exists
  2. Private registry auth: Confirm imagePullSecrets is configured
  3. Network unreachable: Node cannot access image registry (firewall/proxy)
  4. Image not found: Tag misspelled or image not pushed
# Fix: Configure private registry authentication
spec:
  imagePullSecrets:
    - name: registry-credentials
  containers:
    - name: app
      image: harbor.company.com/project/app:v1.2.3
# Create imagePullSecret
kubectl create secret docker-registry registry-credentials \
  --docker-server=harbor.company.com \
  --docker-username=admin \
  --docker-password=Harbor12345 \
  -n <namespace>

OOMKilled

Container memory exceeds limit and is killed by the kernel OOM Killer.

# Confirm OOMKilled
kubectl describe pod <pod-name> | grep -A5 "Last State"
# Last State:     Terminated  Reason: OOMKilled  Exit Code: 137

# View container memory usage trend
kubectl top pod <pod-name> -n <namespace>

# Check node cgroup logs
dmesg | grep -i oom

Fix Strategies:

# Strategy 1: Increase memory limit
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"      # increased from 512Mi to 1Gi
    cpu: "500m"

# Strategy 2: Optimize application memory usage (JVM example)
env:
  - name: JAVA_OPTS
    value: "-Xms256m -Xmx512m -XX:+UseG1GC"

# Strategy 3: Set QoS to Guaranteed (requests == limits)
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Pod Pending

Pod cannot be scheduled to any node, usually due to insufficient resources or unmet constraints.

# View scheduling failure reason
kubectl describe pod <pod-name> | grep -A20 "Events"
# Common messages:
# 0/3 nodes are available: 3 Insufficient cpu.
# 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

Common Causes and Solutions:

Cause Event Message Solution
Insufficient CPU Insufficient cpu Lower requests or add nodes
Insufficient memory Insufficient memory Lower requests or add nodes
Unbound PVC pod has unbound immediate PersistentVolumeClaims Troubleshoot PVC status first
Node selector mismatch node(s) didn't match node selector Check nodeSelector / nodeAffinity
Taint intolerance node(s) had taints that the pod didn't tolerate Add tolerations
PodDisruptionBudget Cannot evict pod Check PDB configuration

Service and Networking Troubleshooting

DNS Resolution Failure

# Deploy DNS debug Pod
kubectl run dnsutils --image=registry.k8s.io/e2e-test-images/agnhost:2.39 \
  --restart=Never -- sleep infinity

# Test in-cluster DNS resolution
kubectl exec -it dnsutils -- nslookup kubernetes.default.svc.cluster.local
kubectl exec -it dnsutils -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Common CoreDNS Issues:

  1. CoreDNS Pod not running: Check coredns status in kube-system
  2. ConfigMap misconfiguration: Check coredns ConfigMap plugin configuration
  3. ndots causing timeout: ndots=5 in /etc/resolv.conf causes multiple queries
# Optimize Pod DNS configuration
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
      - name: single-request-reopen
      - name: timeout
        value: "3"
  dnsPolicy: ClusterFirst

Empty Endpoints

Service Endpoint list is empty, traffic cannot route to any Pod.

# Check Endpoints
kubectl get endpoints <service-name> -n <namespace>

# Common cause troubleshooting
# 1. Label selector mismatch
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A5 selector

# 2. Pod not Ready
kubectl get pods -n <namespace>  # Check if READY column shows 1/1

# 3. Port mismatch
kubectl describe svc <service-name> -n <namespace>
# Ensure Service selector matches Pod labels
# Service
spec:
  selector:
    app: my-app        # Must match Pod label
  ports:
    - port: 80
      targetPort: 8080  # Must match container port

# Pod
metadata:
  labels:
    app: my-app        # Corresponds to Service selector
spec:
  containers:
    - name: app
      ports:
        - containerPort: 8080

Connection Refused

# Test connectivity from debug Pod
kubectl run debug --image=busybox --rm -it --restart=Never -- sh
# Inside debug Pod
wget -qO- http://<service-name>.<namespace>:<port>/healthz
curl -v telnet://<service-name>.<namespace>:<port>

# Check if NetworkPolicy is blocking
kubectl get networkpolicy -n <namespace>

# Check kube-proxy mode and iptables/ipvs rules
kubectl logs -n kube-system -l k8s-app=kube-proxy
iptables-save | grep <service-cluster-ip>

Deployment Troubleshooting

Rollback Operations

# View rollout history
kubectl rollout history deployment/<name> -n <namespace>

# Rollback to previous version
kubectl rollout undo deployment/<name> -n <namespace>

# Rollback to specific revision
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=2

# Pause and resume rollout (for canary deployment)
kubectl rollout pause deployment/<name> -n <namespace>
kubectl set image deployment/<name> app=my-app:v2 -n <namespace>
kubectl rollout resume deployment/<name> -n <namespace>

Scaling Failures

# Manual scale
kubectl scale deployment/<name> --replicas=5 -n <namespace>

# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>

# Common HPA failure causes
# 1. metrics-server not deployed
kubectl get pods -n kube-system -l k8s-app=metrics-server

# 2. Resource requests not set (HPA cannot calculate utilization)
kubectl describe pod <pod-name> | grep -A5 "Requests"

# 3. Scaling reached max replica limit
kubectl get hpa <hpa-name> -o yaml | grep maxReplicas
# Correct HPA configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

PVC/PV Storage Troubleshooting

PVC Pending

# View PVC status and events
kubectl describe pvc <pvc-name> -n <namespace>

# Common causes
# 1. No matching PV (static provisioning)
kubectl get pv | grep <storage-class>

# 2. StorageClass doesn't exist (dynamic provisioning)
kubectl get storageclass

# 3. Dynamic provisioner not running
kubectl get pods -n <namespace> | grep provisioner

PV Lost

# View PV status
kubectl get pv
# NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM
# pv-data    50Gi       RWO            Delete           Lost     default/pvc-data

# Fix: Confirm if backend storage has recovered
# If storage has recovered, manually clear Lost status
kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'
# Rebind PVC

AccessMode Mismatch

# PVC requests ReadWriteMany but PV only supports ReadWriteOnce
# PVC
spec:
  accessModes:
    - ReadWriteMany    # Needs RWX
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs

# Confirm StorageClass supported accessMode
kubectl describe storageclass nfs
AccessMode Abbreviation Description Typical Backend
ReadWriteOnce RWO Single node read-write AWS EBS, Ceph RBD
ReadOnlyMany ROX Multi-node read-only NFS
ReadWriteMany RWX Multi-node read-write NFS, CephFS
ReadWriteOncePod RWOP Single Pod exclusive read-write CSI supported

Node Troubleshooting

NotReady Status

# View node status
kubectl get nodes
kubectl describe node <node-name>

# Check kubelet service
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Check container runtime
crictl ps
systemctl status containerd

# Common causes
# 1. Kubelet certificate expired
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# 2. Insufficient disk space
df -h /var/lib/kubelet

# 3. Network plugin abnormal
kubectl get pods -n kube-system -o wide | grep <node-name>

Resource Pressure Conditions

# View node conditions
kubectl describe node <node-name> | grep -A20 "Conditions"

# DiskPressure: Node disk usage exceeds threshold
# MemoryPressure: Node available memory is insufficient
# PIDPressure: Too many processes
# Adjust kubelet eviction thresholds (not recommended to lower in production)
kind: KubeletConfiguration
evictionHard:
  memory.available: "100Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "200Mi"
  nodefs.available: "20%"
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "2m"

kubectl Debugging Cheat Sheet

Resource Viewing

Command Purpose
kubectl get all -n <ns> View all resources in namespace
kubectl get pods -o wide View Pods with node info
kubectl get pods --show-labels View Pod labels
kubectl top pods --sort-by=memory Sort by memory
kubectl api-resources View all API resource types

Debugging and Troubleshooting

Command Purpose
kubectl logs <pod> -c <container> --previous Previous container logs
kubectl logs <pod> --since=1h Logs from last 1 hour
kubectl logs <pod> -f --tail=100 Live tail last 100 lines
kubectl exec -it <pod> -- /bin/sh Enter container terminal
kubectl debug <pod> -it --image=busybox Temporary debug container
kubectl port-forward svc/<svc> 8080:80 Port forward for local debugging

Cluster Management

Command Purpose
kubectl cordon <node> Mark node unschedulable
kubectl drain <node> --ignore-daemonsets Evict all Pods from node
kubectl uncordon <node> Restore node scheduling
kubectl taint nodes <node> key=value:NoSchedule Add taint
kubectl label node <node> key=value Add label

Logs and Events Analysis

Centralized Log Collection

# Aggregate logs from multi-container Pod
kubectl logs <pod-name> --all-containers=true

# Batch view using label selector
kubectl logs -l app=my-app -n <namespace> --since=5m

# Export logs to file for analysis
kubectl logs <pod-name> -n <namespace> > pod-debug.log

Event Analysis

# View all namespace events (sorted by time)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Filter by type
kubectl get events -n <namespace> --field-selector type=Warning

# Continuously monitor events
kubectl get events -n <namespace> --watch

# View events for specific resource
kubectl events --for pod/<pod-name> -n <namespace>

Log Levels and Formats

# API Server audit logs
cat /var/log/kubernetes/audit.log | jq '. | select(.responseStatus.code >= 400)'

# Kubelet logs
journalctl -u kubelet -f

# etcd logs
kubectl logs -n kube-system etcd-<node-name> --since=10m

Resource Limits and Requests Best Practices

QoS Classes

QoS Class Condition Eviction Priority Use Case
Guaranteed requests == limits (CPU+memory) Lowest (killed last) Core services
Burstable requests < limits Medium General services
BestEffort No requests/limits set Highest (killed first) Batch jobs
# Production recommended configuration template
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

# Use LimitRange to set namespace defaults
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "4Gi"

Common Resource Issues

# View node resource usage
kubectl top nodes
kubectl describe node <node-name> | grep -A10 "Allocated resources"

# View namespace resource quotas
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota -n <namespace>

NetworkPolicy Debugging

Policy Blocking Troubleshooting

# View all NetworkPolicies in namespace
kubectl get networkpolicy -n <namespace>

# View policy details
kubectl describe networkpolicy <policy-name> -n <namespace>
# Common NetworkPolicy configuration error
# Error: Forgot to allow DNS traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

Network Connectivity Testing

# Deploy network debugging tool
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never

# Test inside debug Pod
# Same namespace
curl http://<service-name>:<port>/healthz

# Cross namespace
curl http://<service-name>.<namespace>.svc.cluster.local:<port>/healthz

# Test external connectivity
curl -I https://external-api.example.com

Common Helm Chart Issues

Release Installation Failure

# View Release status
helm status <release-name> -n <namespace>
helm history <release-name> -n <namespace>

# View rendered YAML (without installing)
helm template <chart> --debug

# View failed Release values
helm get values <release-name> -n <namespace>

# Rollback Helm Release
helm rollback <release-name> <revision> -n <namespace>

Common Helm Errors

Error Cause Fix
rendered manifests contain a resource that already exists Resource conflict Use --replace or clean up old resources
Release "xxx" has not been installed yet Release doesn't exist Check namespace and release name
timed out waiting for the condition Pod not Ready Check --timeout value and Pod status
YAML parse error values.yaml syntax error Validate with /encode/yaml
# Clean up failed Release
helm uninstall <release-name> -n <namespace>

# Force install (dev environment)
helm install <release-name> <chart> --replace -n <namespace>

Prometheus Monitoring Alerts

Key Alert Rules

groups:
  - name: kubernetes-alerts
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="unknown"} == 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is not ready"

      - alert: PVCAlmostFull
        expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is >85% full"

      - alert: DeploymentReplicasMismatch
        expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
        for: 10m
        labels:
          severity: warning

Common PromQL Queries

# Pod memory utilization
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100

# Node CPU utilization
rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100

# Request error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# PVC utilization
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes

Emergency Response Playbook

P0: Cluster Unavailable

# 1. Check API Server
kubectl get componentstatuses
curl -k https://<api-server>:6443/healthz

# 2. Check etcd
kubectl logs -n kube-system etcd-<node> --tail=50
ETCDCTL_API=3 etcdctl endpoint health --cluster

# 3. Check kubelet
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"

# 4. Emergency recovery: Restart control plane components
systemctl restart kubelet
systemctl restart containerd

P1: Service Fully Unavailable

# 1. Quick rollback
kubectl rollout undo deployment/<name> -n <namespace>

# 2. Emergency scale up
kubectl scale deployment/<name> --replicas=10 -n <namespace>

# 3. Emergency node maintenance
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

P2: Storage Failure

# 1. Check PV status
kubectl get pv | grep -v Bound

# 2. Check CSI Driver
kubectl get pods -n <csi-namespace>

# 3. Emergency: Force delete stuck Pod
kubectl delete pod <pod-name> --force --grace-period=0 -n <namespace>

FAQ

Q: Pod stuck in ContainerCreating state, what to do?

A: Usually caused by slow image pull or StorageClass mount failure. Use kubectl describe pod to check Events, focusing on image pull and volume mount events.

Q: How to view logs of a deleted Pod?

A: If log aggregation (EFK/PLG) is configured, query through the log platform. Otherwise, deleted Pod logs cannot be recovered — always deploy a log collection solution in production.

Q: kubectl command times out, how to troubleshoot?

A: Usually API Server overload or network issues. Check the server address in ~/.kube/config, test connectivity with kubectl get --request-timeout=5s nodes.

Q: How to quickly identify which node has insufficient resources?

A: Use kubectl top nodes for real-time resources, or kubectl describe node | grep -A5 "Allocated resources" for allocated amounts.

Q: Service ClusterIP unreachable, how to troubleshoot?

A: Check in order: ① kube-proxy running normally ② iptables/ipvs rules correct ③ Endpoints not empty ④ NetworkPolicy not blocking ⑤ CNI plugin status.

Q: How to prevent Pods from being unexpectedly evicted?

A: Set PodDisruptionBudget to guarantee minimum available replicas for critical services.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Q: Helm upgrade failed but Release state is stuck, what to do?

A: Use helm rollback to roll back, or helm history to view previous versions. If the Release is locked, use helm rollback <release> <last-good-revision> to recover.

Try these browser-local tools — no sign-up required →

#Kubernetes#K8s#故障排查#运维#教程