Kubernetes Troubleshooting in Practice: From Diagnosis to Fix
Kubernetes Troubleshooting Methodology
Kubernetes cluster troubleshooting isn't "guesswork" — it's a systematic, layer-by-layer diagnostic approach. Only by progressing from the application layer to the infrastructure layer can you efficiently pinpoint root causes.
Layer-by-Layer Diagnostic Model
| Layer | Scope | Key Commands |
|---|---|---|
| L1 Application | Pod status, container logs | kubectl logs, kubectl describe pod |
| L2 Service | Service, Endpoint, DNS | kubectl get endpoints, nslookup |
| L3 Scheduling | Deployment, ReplicaSet, HPA | kubectl rollout status, kubectl get hpa |
| L4 Storage | PVC, PV, StorageClass | kubectl describe pvc, kubectl get pv |
| L5 Node | Node status, resource pressure | kubectl describe node, kubectl top nodes |
| L6 Infrastructure | Network plugin, API Server, etcd | systemctl status kubelet, crictl ps |
Golden Diagnostic Workflow
# Step 1: Check overall cluster health
kubectl get nodes
kubectl get componentstatuses
# Step 2: Identify the problematic namespace
kubectl get all -n <namespace>
# Step 3: Focus on the abnormal resource
kubectl describe <resource> <name> -n <namespace>
# Step 4: Review events and logs
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl logs <pod-name> -n <namespace> --previous
# Step 5: Enter the container for investigation
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
Pod Troubleshooting
Pods are the smallest scheduling unit in Kubernetes and the resource with the highest failure density. Mastering the diagnostic method for each abnormal state is essential.
CrashLoopBackOff
The container exits immediately after starting, and Kubelet repeatedly restarts it, causing a crash loop.
# View current logs
kubectl logs <pod-name> -n <namespace>
# View logs from the last crash (critical!)
kubectl logs <pod-name> -n <namespace> --previous
# Check container exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Last State"
Common Causes and Fixes:
| Exit Code | Meaning | Fix |
|---|---|---|
| 0 | Normal exit but no long-running mode set | Check entry command, ensure foreground execution |
| 1 | Application error (uncaught exception) | Check application logs, fix code logic |
| 137 | OOMKilled (received SIGKILL) | Increase resources.limits.memory |
| 139 | Segmentation Fault | Check dependency library compatibility |
| 143 | SIGTERM graceful termination | Check preStop hook or termination signal handling |
# Fix example: Add startup probe to prevent slow starts from being killed
spec:
containers:
- name: app
image: my-app:v1
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 15
failureThreshold: 3
ImagePullBackOff
Image pull failure, typically caused by image address, authentication, or network issues.
# View detailed error information
kubectl describe pod <pod-name> | grep -A10 "Events"
# Common error messages
# ErrImagePull: registry.k8s.io/pause:3.9 - connection timeout
# ImagePullBackOff: unauthorized: authentication required
Troubleshooting Checklist:
- Wrong image address: Check image field spelling, confirm tag exists
- Private registry auth: Confirm imagePullSecrets is configured
- Network unreachable: Node cannot access image registry (firewall/proxy)
- Image not found: Tag misspelled or image not pushed
# Fix: Configure private registry authentication
spec:
imagePullSecrets:
- name: registry-credentials
containers:
- name: app
image: harbor.company.com/project/app:v1.2.3
# Create imagePullSecret
kubectl create secret docker-registry registry-credentials \
--docker-server=harbor.company.com \
--docker-username=admin \
--docker-password=Harbor12345 \
-n <namespace>
OOMKilled
Container memory exceeds limit and is killed by the kernel OOM Killer.
# Confirm OOMKilled
kubectl describe pod <pod-name> | grep -A5 "Last State"
# Last State: Terminated Reason: OOMKilled Exit Code: 137
# View container memory usage trend
kubectl top pod <pod-name> -n <namespace>
# Check node cgroup logs
dmesg | grep -i oom
Fix Strategies:
# Strategy 1: Increase memory limit
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi" # increased from 512Mi to 1Gi
cpu: "500m"
# Strategy 2: Optimize application memory usage (JVM example)
env:
- name: JAVA_OPTS
value: "-Xms256m -Xmx512m -XX:+UseG1GC"
# Strategy 3: Set QoS to Guaranteed (requests == limits)
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "500m"
Pod Pending
Pod cannot be scheduled to any node, usually due to insufficient resources or unmet constraints.
# View scheduling failure reason
kubectl describe pod <pod-name> | grep -A20 "Events"
# Common messages:
# 0/3 nodes are available: 3 Insufficient cpu.
# 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Common Causes and Solutions:
| Cause | Event Message | Solution |
|---|---|---|
| Insufficient CPU | Insufficient cpu | Lower requests or add nodes |
| Insufficient memory | Insufficient memory | Lower requests or add nodes |
| Unbound PVC | pod has unbound immediate PersistentVolumeClaims | Troubleshoot PVC status first |
| Node selector mismatch | node(s) didn't match node selector | Check nodeSelector / nodeAffinity |
| Taint intolerance | node(s) had taints that the pod didn't tolerate | Add tolerations |
| PodDisruptionBudget | Cannot evict pod | Check PDB configuration |
Service and Networking Troubleshooting
DNS Resolution Failure
# Deploy DNS debug Pod
kubectl run dnsutils --image=registry.k8s.io/e2e-test-images/agnhost:2.39 \
--restart=Never -- sleep infinity
# Test in-cluster DNS resolution
kubectl exec -it dnsutils -- nslookup kubernetes.default.svc.cluster.local
kubectl exec -it dnsutils -- nslookup <service-name>.<namespace>.svc.cluster.local
# Check CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
Common CoreDNS Issues:
- CoreDNS Pod not running: Check coredns status in kube-system
- ConfigMap misconfiguration: Check
corednsConfigMap plugin configuration - ndots causing timeout:
ndots=5in/etc/resolv.confcauses multiple queries
# Optimize Pod DNS configuration
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopen
- name: timeout
value: "3"
dnsPolicy: ClusterFirst
Empty Endpoints
Service Endpoint list is empty, traffic cannot route to any Pod.
# Check Endpoints
kubectl get endpoints <service-name> -n <namespace>
# Common cause troubleshooting
# 1. Label selector mismatch
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A5 selector
# 2. Pod not Ready
kubectl get pods -n <namespace> # Check if READY column shows 1/1
# 3. Port mismatch
kubectl describe svc <service-name> -n <namespace>
# Ensure Service selector matches Pod labels
# Service
spec:
selector:
app: my-app # Must match Pod label
ports:
- port: 80
targetPort: 8080 # Must match container port
# Pod
metadata:
labels:
app: my-app # Corresponds to Service selector
spec:
containers:
- name: app
ports:
- containerPort: 8080
Connection Refused
# Test connectivity from debug Pod
kubectl run debug --image=busybox --rm -it --restart=Never -- sh
# Inside debug Pod
wget -qO- http://<service-name>.<namespace>:<port>/healthz
curl -v telnet://<service-name>.<namespace>:<port>
# Check if NetworkPolicy is blocking
kubectl get networkpolicy -n <namespace>
# Check kube-proxy mode and iptables/ipvs rules
kubectl logs -n kube-system -l k8s-app=kube-proxy
iptables-save | grep <service-cluster-ip>
Deployment Troubleshooting
Rollback Operations
# View rollout history
kubectl rollout history deployment/<name> -n <namespace>
# Rollback to previous version
kubectl rollout undo deployment/<name> -n <namespace>
# Rollback to specific revision
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=2
# Pause and resume rollout (for canary deployment)
kubectl rollout pause deployment/<name> -n <namespace>
kubectl set image deployment/<name> app=my-app:v2 -n <namespace>
kubectl rollout resume deployment/<name> -n <namespace>
Scaling Failures
# Manual scale
kubectl scale deployment/<name> --replicas=5 -n <namespace>
# Check HPA status
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
# Common HPA failure causes
# 1. metrics-server not deployed
kubectl get pods -n kube-system -l k8s-app=metrics-server
# 2. Resource requests not set (HPA cannot calculate utilization)
kubectl describe pod <pod-name> | grep -A5 "Requests"
# 3. Scaling reached max replica limit
kubectl get hpa <hpa-name> -o yaml | grep maxReplicas
# Correct HPA configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
PVC/PV Storage Troubleshooting
PVC Pending
# View PVC status and events
kubectl describe pvc <pvc-name> -n <namespace>
# Common causes
# 1. No matching PV (static provisioning)
kubectl get pv | grep <storage-class>
# 2. StorageClass doesn't exist (dynamic provisioning)
kubectl get storageclass
# 3. Dynamic provisioner not running
kubectl get pods -n <namespace> | grep provisioner
PV Lost
# View PV status
kubectl get pv
# NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
# pv-data 50Gi RWO Delete Lost default/pvc-data
# Fix: Confirm if backend storage has recovered
# If storage has recovered, manually clear Lost status
kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'
# Rebind PVC
AccessMode Mismatch
# PVC requests ReadWriteMany but PV only supports ReadWriteOnce
# PVC
spec:
accessModes:
- ReadWriteMany # Needs RWX
resources:
requests:
storage: 10Gi
storageClassName: nfs
# Confirm StorageClass supported accessMode
kubectl describe storageclass nfs
| AccessMode | Abbreviation | Description | Typical Backend |
|---|---|---|---|
| ReadWriteOnce | RWO | Single node read-write | AWS EBS, Ceph RBD |
| ReadOnlyMany | ROX | Multi-node read-only | NFS |
| ReadWriteMany | RWX | Multi-node read-write | NFS, CephFS |
| ReadWriteOncePod | RWOP | Single Pod exclusive read-write | CSI supported |
Node Troubleshooting
NotReady Status
# View node status
kubectl get nodes
kubectl describe node <node-name>
# Check kubelet service
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager
# Check container runtime
crictl ps
systemctl status containerd
# Common causes
# 1. Kubelet certificate expired
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
# 2. Insufficient disk space
df -h /var/lib/kubelet
# 3. Network plugin abnormal
kubectl get pods -n kube-system -o wide | grep <node-name>
Resource Pressure Conditions
# View node conditions
kubectl describe node <node-name> | grep -A20 "Conditions"
# DiskPressure: Node disk usage exceeds threshold
# MemoryPressure: Node available memory is insufficient
# PIDPressure: Too many processes
# Adjust kubelet eviction thresholds (not recommended to lower in production)
kind: KubeletConfiguration
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
imagefs.available: "15%"
evictionSoft:
memory.available: "200Mi"
nodefs.available: "20%"
evictionSoftGracePeriod:
memory.available: "1m30s"
nodefs.available: "2m"
kubectl Debugging Cheat Sheet
Resource Viewing
| Command | Purpose |
|---|---|
kubectl get all -n <ns> |
View all resources in namespace |
kubectl get pods -o wide |
View Pods with node info |
kubectl get pods --show-labels |
View Pod labels |
kubectl top pods --sort-by=memory |
Sort by memory |
kubectl api-resources |
View all API resource types |
Debugging and Troubleshooting
| Command | Purpose |
|---|---|
kubectl logs <pod> -c <container> --previous |
Previous container logs |
kubectl logs <pod> --since=1h |
Logs from last 1 hour |
kubectl logs <pod> -f --tail=100 |
Live tail last 100 lines |
kubectl exec -it <pod> -- /bin/sh |
Enter container terminal |
kubectl debug <pod> -it --image=busybox |
Temporary debug container |
kubectl port-forward svc/<svc> 8080:80 |
Port forward for local debugging |
Cluster Management
| Command | Purpose |
|---|---|
kubectl cordon <node> |
Mark node unschedulable |
kubectl drain <node> --ignore-daemonsets |
Evict all Pods from node |
kubectl uncordon <node> |
Restore node scheduling |
kubectl taint nodes <node> key=value:NoSchedule |
Add taint |
kubectl label node <node> key=value |
Add label |
Logs and Events Analysis
Centralized Log Collection
# Aggregate logs from multi-container Pod
kubectl logs <pod-name> --all-containers=true
# Batch view using label selector
kubectl logs -l app=my-app -n <namespace> --since=5m
# Export logs to file for analysis
kubectl logs <pod-name> -n <namespace> > pod-debug.log
Event Analysis
# View all namespace events (sorted by time)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Filter by type
kubectl get events -n <namespace> --field-selector type=Warning
# Continuously monitor events
kubectl get events -n <namespace> --watch
# View events for specific resource
kubectl events --for pod/<pod-name> -n <namespace>
Log Levels and Formats
# API Server audit logs
cat /var/log/kubernetes/audit.log | jq '. | select(.responseStatus.code >= 400)'
# Kubelet logs
journalctl -u kubelet -f
# etcd logs
kubectl logs -n kube-system etcd-<node-name> --since=10m
Resource Limits and Requests Best Practices
QoS Classes
| QoS Class | Condition | Eviction Priority | Use Case |
|---|---|---|---|
| Guaranteed | requests == limits (CPU+memory) | Lowest (killed last) | Core services |
| Burstable | requests < limits | Medium | General services |
| BestEffort | No requests/limits set | Highest (killed first) | Batch jobs |
# Production recommended configuration template
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Use LimitRange to set namespace defaults
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "4Gi"
Common Resource Issues
# View node resource usage
kubectl top nodes
kubectl describe node <node-name> | grep -A10 "Allocated resources"
# View namespace resource quotas
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota -n <namespace>
NetworkPolicy Debugging
Policy Blocking Troubleshooting
# View all NetworkPolicies in namespace
kubectl get networkpolicy -n <namespace>
# View policy details
kubectl describe networkpolicy <policy-name> -n <namespace>
# Common NetworkPolicy configuration error
# Error: Forgot to allow DNS traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-egress
spec:
podSelector:
matchLabels:
app: my-app
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
Network Connectivity Testing
# Deploy network debugging tool
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never
# Test inside debug Pod
# Same namespace
curl http://<service-name>:<port>/healthz
# Cross namespace
curl http://<service-name>.<namespace>.svc.cluster.local:<port>/healthz
# Test external connectivity
curl -I https://external-api.example.com
Common Helm Chart Issues
Release Installation Failure
# View Release status
helm status <release-name> -n <namespace>
helm history <release-name> -n <namespace>
# View rendered YAML (without installing)
helm template <chart> --debug
# View failed Release values
helm get values <release-name> -n <namespace>
# Rollback Helm Release
helm rollback <release-name> <revision> -n <namespace>
Common Helm Errors
| Error | Cause | Fix |
|---|---|---|
rendered manifests contain a resource that already exists |
Resource conflict | Use --replace or clean up old resources |
Release "xxx" has not been installed yet |
Release doesn't exist | Check namespace and release name |
timed out waiting for the condition |
Pod not Ready | Check --timeout value and Pod status |
YAML parse error |
values.yaml syntax error | Validate with /encode/yaml |
# Clean up failed Release
helm uninstall <release-name> -n <namespace>
# Force install (dev environment)
helm install <release-name> <chart> --replace -n <namespace>
Prometheus Monitoring Alerts
Key Alert Rules
groups:
- name: kubernetes-alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="unknown"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
- alert: PVCAlmostFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is >85% full"
- alert: DeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 10m
labels:
severity: warning
Common PromQL Queries
# Pod memory utilization
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100
# Node CPU utilization
rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100
# Request error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# PVC utilization
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes
Emergency Response Playbook
P0: Cluster Unavailable
# 1. Check API Server
kubectl get componentstatuses
curl -k https://<api-server>:6443/healthz
# 2. Check etcd
kubectl logs -n kube-system etcd-<node> --tail=50
ETCDCTL_API=3 etcdctl endpoint health --cluster
# 3. Check kubelet
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago"
# 4. Emergency recovery: Restart control plane components
systemctl restart kubelet
systemctl restart containerd
P1: Service Fully Unavailable
# 1. Quick rollback
kubectl rollout undo deployment/<name> -n <namespace>
# 2. Emergency scale up
kubectl scale deployment/<name> --replicas=10 -n <namespace>
# 3. Emergency node maintenance
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
P2: Storage Failure
# 1. Check PV status
kubectl get pv | grep -v Bound
# 2. Check CSI Driver
kubectl get pods -n <csi-namespace>
# 3. Emergency: Force delete stuck Pod
kubectl delete pod <pod-name> --force --grace-period=0 -n <namespace>
FAQ
Q: Pod stuck in ContainerCreating state, what to do?
A: Usually caused by slow image pull or StorageClass mount failure. Use kubectl describe pod to check Events, focusing on image pull and volume mount events.
Q: How to view logs of a deleted Pod?
A: If log aggregation (EFK/PLG) is configured, query through the log platform. Otherwise, deleted Pod logs cannot be recovered — always deploy a log collection solution in production.
Q: kubectl command times out, how to troubleshoot?
A: Usually API Server overload or network issues. Check the server address in ~/.kube/config, test connectivity with kubectl get --request-timeout=5s nodes.
Q: How to quickly identify which node has insufficient resources?
A: Use kubectl top nodes for real-time resources, or kubectl describe node | grep -A5 "Allocated resources" for allocated amounts.
Q: Service ClusterIP unreachable, how to troubleshoot?
A: Check in order: ① kube-proxy running normally ② iptables/ipvs rules correct ③ Endpoints not empty ④ NetworkPolicy not blocking ⑤ CNI plugin status.
Q: How to prevent Pods from being unexpectedly evicted?
A: Set PodDisruptionBudget to guarantee minimum available replicas for critical services.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Q: Helm upgrade failed but Release state is stuck, what to do?
A: Use helm rollback to roll back, or helm history to view previous versions. If the Release is locked, use helm rollback <release> <last-good-revision> to recover.
Try these browser-local tools — no sign-up required →