Production incidents taught me more than any tutorial. Here are the critical Kubernetes lessons that will save you from midnight debugging sessions.

5 Kubernetes Lessons I Learned the Hard Way

After managing Kubernetes clusters in production for several years, I've accumulated some battle scars. Here are the lessons that cost me sleep but saved future headaches.

1. Resource Limits Are Not Optional

The Mistake: I deployed a service without memory limits. During a traffic spike, it consumed all available memory, taking down the entire node.

# DON'T DO THIS
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:latest

The Fix: Always set resource requests and limits.

# DO THIS
resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

2. Readiness vs Liveness Probes

The Mistake: Using the same endpoint for both probes caused cascading failures during deployment.

The Solution:

Liveness: Is the container alive? (restart if fails)
Readiness: Can it handle traffic? (remove from service if fails)

These should be different checks with different failure tolerances.

3. PodDisruptionBudgets Save Lives

During cluster upgrades, all replicas went down simultaneously. PDBs ensure minimum availability:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

4. Network Policies Are Your Friend

The Wake-Up Call: A compromised service accessed our database directly. Network policies would have prevented this.

Default deny, then explicitly allow:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-to-db
spec:
  podSelector:
    matchLabels:
      app: database
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api

5. Horizontal Pod Autoscaling Needs Metrics

HPA without proper metrics is useless. We learned this during Black Friday when manual scaling was too slow.

Required:

Install metrics-server
Define meaningful metrics (not just CPU)
Test autoscaling before you need it

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70