What is Chaos Engineering? How do you practice it safely in production?

What Chaos Engineering Is

Proactively injecting failures into a system to verify its resilience and fault tolerance — finding weaknesses before real disasters occur.

Pioneered by Netflix in 2011, with the most famous tool being Chaos Monkey (randomly terminates EC2 instances in production).

1. Define steady state: Identify metrics that define normal system operation (P99 latency, error rate)

2. Design experiments: Introduce real-world failure scenarios

3. Run in production (or as close to production as possible): Staging environments can't fully replicate production complexity

4. Minimize blast radius: Start small (test 1% of traffic first), then gradually expand

5. Automate continuous execution: Integrate chaos experiments into CI/CD to ensure resilience after every change

Fault Type	Examples
Resource exhaustion	High CPU/memory load
Network failure	Latency injection, packet loss, network partition
Service dependencies	Kill downstream services, simulate timeouts
Infrastructure	Randomly terminate Pods/VMs, availability zone failure
Data layer	Slow query injection into database

Start with non-production environments, build confidence before going to production
Monitor continuously during experiments; stop immediately if anomalies are detected
Notify relevant teams to avoid unnecessary alarm
Always have a rollback plan