Chaos Engineering

Apply chaos engineering principles to improve system resilience. Learn fault injection, chaos experiments, and tools like Chaos Monkey and Litmus.

What Is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its capability to withstand turbulent conditions in production. It was pioneered by Netflix, who created Chaos Monkey to randomly terminate production instances and verify the system remained available.

The core insight: rather than waiting for failures to happen unexpectedly, proactively inject failures and observe how the system responds. This is fundamentally different from traditional testing — you are testing the system’s resilience, not its functionality.

Principles of Chaos Engineering

Build a hypothesis around steady state behavior. Define what “normal” looks like: error rate, latency, throughput.
Vary real-world events. Inject failures that actually happen: server crashes, network partitions, disk full, high latency.
Run experiments in production. The closer to production, the more meaningful the results. Start in staging, graduate to production.
Automate experiments to run continuously. One-time experiments are useful; continuous experiments build ongoing confidence.
Minimize blast radius. Start small — one instance, one region, a small percentage of traffic.

The Chaos Experiment Lifecycle

Step 1: Define Steady State

Measurable indicators of normal system behavior:

Error rate < 0.1%
P95 response time < 200ms
All health checks passing
Order completion rate > 99%

Step 2: Hypothesize

“If we terminate one instance of the payment service, the system will continue processing payments with no increase in error rate because the load balancer routes to healthy instances.”

Step 3: Design the Experiment

Target: Payment service, one instance
Failure type: Process termination (kill -9)
Duration: 5 minutes
Blast radius: 1 of 4 instances (25% capacity reduction)
Abort criteria: Error rate > 1% or P95 latency > 1 second

Step 4: Execute

Run the experiment with monitoring active. Have a kill switch ready to stop the experiment immediately if abort criteria are met.

Step 5: Analyze

Compare metrics during and after the experiment with the steady state:

Did error rate increase? By how much?
Did response time increase? For how long?
Did the system recover automatically? How quickly?
Did monitoring detect the issue? Did alerts fire?

Step 6: Fix and Repeat

If the system did not behave as expected, fix the weakness and repeat the experiment to verify the fix.

Types of Chaos Experiments

Experiment	What It Tests	Example
Instance termination	Auto-scaling, load balancing	Kill a random pod/VM
Network latency	Timeout handling, retries	Add 500ms latency between services
Network partition	Split-brain handling, consistency	Block traffic between two services
Disk full	Logging, data handling	Fill disk to 100%
CPU/Memory stress	Throttling, resource limits	Consume 90% CPU on a node
DNS failure	Fallback mechanisms	Block DNS resolution
Dependency failure	Circuit breakers, fallbacks	Make a third-party API unavailable

Chaos Engineering Tools

Tool	Type	Best For
Chaos Monkey	Open source (Netflix)	Random instance termination
Litmus	Open source (CNCF)	Kubernetes-native chaos experiments
Gremlin	SaaS	Enterprise chaos-as-a-service
Chaos Toolkit	Open source	Framework for defining experiments in JSON/YAML
Toxiproxy	Open source (Shopify)	Network condition simulation
AWS Fault Injection Simulator	AWS service	AWS-specific fault injection

QA’s Role in Chaos Engineering

QA engineers bring unique value to chaos engineering:

Test design skills: QA knows how to design experiments that reveal weaknesses
Monitoring knowledge: QA understands which metrics indicate real problems
Risk assessment: QA can evaluate which experiments are safe to run and in what order
Validation: QA verifies that fixes actually resolve the discovered weaknesses

Exercise: Design a Chaos Experiment

Your e-commerce application has: API gateway, product service, cart service, payment service, notification service, PostgreSQL, Redis.

Design three chaos experiments in order of increasing risk.

Solution

Experiment 1: Redis Cache Failure (Low Risk)

Hypothesis: If Redis becomes unavailable, the application continues serving requests (with degraded performance) by falling back to database queries.

Steady state: P95 < 200ms, error rate < 0.1%, product pages load successfully

Injection: Stop Redis container for 5 minutes

Expected behavior: Response time increases to 500-800ms (database fallback), no errors, product pages still load

Abort criteria: Error rate > 1% or product pages return 500 errors

Experiment 2: Payment Service Instance Failure (Medium Risk)

Hypothesis: If one of three payment service instances dies, the load balancer routes to healthy instances with no failed payments.

Steady state: Payment success rate > 99.5%, P95 < 500ms

Injection: Kill one payment service pod (kubernetes delete pod)

Expected behavior: Kubernetes restarts the pod within 30 seconds. During that time, remaining instances handle traffic. No failed payments.

Abort criteria: Payment success rate < 98% or P95 > 2 seconds

Experiment 3: Network Partition Between Services (Higher Risk)

Hypothesis: If the product service cannot reach the inventory service, it serves cached inventory data and marks items as “check availability” instead of showing “in stock/out of stock.”

Steady state: Product pages show accurate inventory, error rate < 0.1%

Injection: Block network between product-service and inventory-service for 3 minutes

Expected behavior: Product pages load, show cached inventory or “check availability” message. No 500 errors.

Abort criteria: Error rate > 2% or product pages fail to load

Key Takeaways

Chaos engineering is proactive resilience testing — find weaknesses before they find you
Always define steady state first — you cannot detect failure without knowing what normal looks like
Start small and escalate — begin in staging, with small blast radius, then move to production
Automate and repeat — one-time experiments are good; continuous experiments are better
QA brings unique value — test design skills and monitoring knowledge are exactly what chaos engineering needs

Chaos Engineering

What You Will Learn

What Is Chaos Engineering?

Principles of Chaos Engineering

The Chaos Experiment Lifecycle

Step 1: Define Steady State

Step 2: Hypothesize

Step 3: Design the Experiment

Step 4: Execute

Step 5: Analyze

Step 6: Fix and Repeat

Types of Chaos Experiments

Chaos Engineering Tools

QA’s Role in Chaos Engineering

Exercise: Design a Chaos Experiment

Experiment 1: Redis Cache Failure (Low Risk)

Experiment 2: Payment Service Instance Failure (Medium Risk)

Experiment 3: Network Partition Between Services (Higher Risk)

Key Takeaways

Knowledge Check

Chaos Engineering

What You Will Learn

What Is Chaos Engineering? #

Principles of Chaos Engineering #

The Chaos Experiment Lifecycle #

Step 1: Define Steady State #

Step 2: Hypothesize #

Step 3: Design the Experiment #

Step 4: Execute #

Step 5: Analyze #

Step 6: Fix and Repeat #

Types of Chaos Experiments #

Chaos Engineering Tools #

QA’s Role in Chaos Engineering #

Exercise: Design a Chaos Experiment #

Experiment 1: Redis Cache Failure (Low Risk) #

Experiment 2: Payment Service Instance Failure (Medium Risk) #

Experiment 3: Network Partition Between Services (Higher Risk) #

Key Takeaways #

Knowledge Check

What Is Chaos Engineering?

Principles of Chaos Engineering

The Chaos Experiment Lifecycle

Step 1: Define Steady State

Step 2: Hypothesize

Step 3: Design the Experiment

Step 4: Execute

Step 5: Analyze

Step 6: Fix and Repeat

Types of Chaos Experiments

Chaos Engineering Tools

QA’s Role in Chaos Engineering

Exercise: Design a Chaos Experiment

Experiment 1: Redis Cache Failure (Low Risk)

Experiment 2: Payment Service Instance Failure (Medium Risk)

Experiment 3: Network Partition Between Services (Higher Risk)

Key Takeaways