What Is Chaos Engineering?
Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its capability to withstand turbulent conditions in production. It was pioneered by Netflix, who created Chaos Monkey to randomly terminate production instances and verify the system remained available.
The core insight: rather than waiting for failures to happen unexpectedly, proactively inject failures and observe how the system responds. This is fundamentally different from traditional testing — you are testing the system’s resilience, not its functionality.
Principles of Chaos Engineering
- Build a hypothesis around steady state behavior. Define what “normal” looks like: error rate, latency, throughput.
- Vary real-world events. Inject failures that actually happen: server crashes, network partitions, disk full, high latency.
- Run experiments in production. The closer to production, the more meaningful the results. Start in staging, graduate to production.
- Automate experiments to run continuously. One-time experiments are useful; continuous experiments build ongoing confidence.
- Minimize blast radius. Start small — one instance, one region, a small percentage of traffic.
The Chaos Experiment Lifecycle
Step 1: Define Steady State
Measurable indicators of normal system behavior:
- Error rate < 0.1%
- P95 response time < 200ms
- All health checks passing
- Order completion rate > 99%
Step 2: Hypothesize
“If we terminate one instance of the payment service, the system will continue processing payments with no increase in error rate because the load balancer routes to healthy instances.”
Step 3: Design the Experiment
- Target: Payment service, one instance
- Failure type: Process termination (kill -9)
- Duration: 5 minutes
- Blast radius: 1 of 4 instances (25% capacity reduction)
- Abort criteria: Error rate > 1% or P95 latency > 1 second
Step 4: Execute
Run the experiment with monitoring active. Have a kill switch ready to stop the experiment immediately if abort criteria are met.
Step 5: Analyze
Compare metrics during and after the experiment with the steady state:
- Did error rate increase? By how much?
- Did response time increase? For how long?
- Did the system recover automatically? How quickly?
- Did monitoring detect the issue? Did alerts fire?
Step 6: Fix and Repeat
If the system did not behave as expected, fix the weakness and repeat the experiment to verify the fix.
Types of Chaos Experiments
| Experiment | What It Tests | Example |
|---|---|---|
| Instance termination | Auto-scaling, load balancing | Kill a random pod/VM |
| Network latency | Timeout handling, retries | Add 500ms latency between services |
| Network partition | Split-brain handling, consistency | Block traffic between two services |
| Disk full | Logging, data handling | Fill disk to 100% |
| CPU/Memory stress | Throttling, resource limits | Consume 90% CPU on a node |
| DNS failure | Fallback mechanisms | Block DNS resolution |
| Dependency failure | Circuit breakers, fallbacks | Make a third-party API unavailable |
Chaos Engineering Tools
| Tool | Type | Best For |
|---|---|---|
| Chaos Monkey | Open source (Netflix) | Random instance termination |
| Litmus | Open source (CNCF) | Kubernetes-native chaos experiments |
| Gremlin | SaaS | Enterprise chaos-as-a-service |
| Chaos Toolkit | Open source | Framework for defining experiments in JSON/YAML |
| Toxiproxy | Open source (Shopify) | Network condition simulation |
| AWS Fault Injection Simulator | AWS service | AWS-specific fault injection |
QA’s Role in Chaos Engineering
QA engineers bring unique value to chaos engineering:
- Test design skills: QA knows how to design experiments that reveal weaknesses
- Monitoring knowledge: QA understands which metrics indicate real problems
- Risk assessment: QA can evaluate which experiments are safe to run and in what order
- Validation: QA verifies that fixes actually resolve the discovered weaknesses
Exercise: Design a Chaos Experiment
Your e-commerce application has: API gateway, product service, cart service, payment service, notification service, PostgreSQL, Redis.
Design three chaos experiments in order of increasing risk.
Solution
Experiment 1: Redis Cache Failure (Low Risk)
Hypothesis: If Redis becomes unavailable, the application continues serving requests (with degraded performance) by falling back to database queries.
Steady state: P95 < 200ms, error rate < 0.1%, product pages load successfully
Injection: Stop Redis container for 5 minutes
Expected behavior: Response time increases to 500-800ms (database fallback), no errors, product pages still load
Abort criteria: Error rate > 1% or product pages return 500 errors
Experiment 2: Payment Service Instance Failure (Medium Risk)
Hypothesis: If one of three payment service instances dies, the load balancer routes to healthy instances with no failed payments.
Steady state: Payment success rate > 99.5%, P95 < 500ms
Injection: Kill one payment service pod (kubernetes delete pod)
Expected behavior: Kubernetes restarts the pod within 30 seconds. During that time, remaining instances handle traffic. No failed payments.
Abort criteria: Payment success rate < 98% or P95 > 2 seconds
Experiment 3: Network Partition Between Services (Higher Risk)
Hypothesis: If the product service cannot reach the inventory service, it serves cached inventory data and marks items as “check availability” instead of showing “in stock/out of stock.”
Steady state: Product pages show accurate inventory, error rate < 0.1%
Injection: Block network between product-service and inventory-service for 3 minutes
Expected behavior: Product pages load, show cached inventory or “check availability” message. No 500 errors.
Abort criteria: Error rate > 2% or product pages fail to load
Key Takeaways
- Chaos engineering is proactive resilience testing — find weaknesses before they find you
- Always define steady state first — you cannot detect failure without knowing what normal looks like
- Start small and escalate — begin in staging, with small blast radius, then move to production
- Automate and repeat — one-time experiments are good; continuous experiments are better
- QA brings unique value — test design skills and monitoring knowledge are exactly what chaos engineering needs