Chaos Engineering: Breaking Systems the Right Way

Chaos Engineering represents a paradigm shift in how we approach system reliability. Rather than hoping our systems will remain stable under adverse conditions, Chaos Engineering proactively injects failures to discover weaknesses before they cause outages in production. Born at Netflix to handle the complexities of cloud-based microservices, Chaos Engineering has evolved into a discipline that combines rigorous experimentation with operational excellence to build truly resilient systems that can withstand performance testing and beyond.

What is Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. It involves deliberately introducing failures—network latency, server crashes, resource exhaustion—to observe how the system responds and to identify weaknesses that could lead to outages.

The Core Philosophy

Traditional testing validates that systems work correctly under expected conditions:

Unit tests verify individual components
Integration tests validate interactions between components
Performance tests ensure acceptable response times under load

Chaos Engineering asks a fundamentally different question: What happens when things go wrong?

In complex distributed systems, failures are not edge cases—they’re inevitable. Networks become partitioned, disks fill up, dependencies become unavailable, and unexpected load patterns emerge. Chaos Engineering embraces this reality by:

Assuming failure is inevitable
Proactively discovering failure modes before they impact users
Building confidence in system resilience through empirical evidence
Continuously validating system behavior as systems evolve

Chaos Engineering vs. Traditional Testing

Aspect	Traditional Testing	Chaos Engineering
Goal	Verify expected behavior	Discover unknown weaknesses
Approach	Deterministic, scripted	Experimental, exploratory
Scope	Component or feature level	System-wide, production-like
Environment	Test/staging environments	Production (ideally)
Mindset	“Does it work?”	“How does it fail?”
Outcome	Pass/fail binary result	Insights and observations

Principles of Chaos Engineering

The Chaos Engineering community has formalized foundational principles that guide effective chaos experiments. These principles, popularized by the Principles of Chaos Engineering manifesto, provide a framework for scientific experimentation on distributed systems.

1. Build a Hypothesis Around Steady-State Behavior

Before introducing chaos, you must understand what “normal” looks like. Steady state refers to the system’s measurable output that indicates normal operation—not internal metrics, but business-relevant indicators.

Examples of steady-state metrics:

E-commerce platform: Orders per minute, checkout success rate
Video streaming service: Stream starts per second, buffering ratio
Payment processor: Transactions per second, successful payment rate
API service: Requests per second, p99 latency, error rate

Hypothesis structure:

Given: [Normal steady-state behavior]
When: [Chaos experiment is introduced]
Then: [Steady state should be maintained OR specific acceptable degradation]

Example hypothesis:

Given: Our API maintains p99 latency < 200ms with 5,000 RPS
When: We terminate 25% of backend service instances
Then: API p99 latency remains < 500ms and error rate stays < 1%

2. Vary Real-World Events

Chaos experiments should reflect actual failures that occur in production environments. Theoretical failures that never happen don’t build confidence in real-world resilience.

Common real-world events to simulate:

Infrastructure failures:

EC2 instance termination
Availability zone outage
Network partition between services
Disk/volume failure
CPU/memory exhaustion

Network issues:

Increased latency (slow networks)
Packet loss
DNS failures
TLS certificate expiration

Dependency failures:

Database unavailability
Cache eviction/failure
Third-party API degradation
Message queue backlog

Resource constraints:

File descriptor exhaustion
Connection pool saturation
Thread pool exhaustion
Disk space exhaustion

Application-level issues:

Memory leaks causing OOM
Deadlocks and race conditions
Configuration errors
Time synchronization issues

3. Run Experiments in Production

The most valuable chaos experiments run in production because:

Production has the actual traffic patterns and scale
Production environments have real dependencies and configurations
Issues often only surface under production conditions
Building confidence requires testing the actual system users interact with

Objections to production chaos and how to address them:

“We’ll cause outages!”

Start with small blast radius (e.g., 1% of traffic)
Use feature flags to instantly disable experiments
Run during low-traffic periods initially
Have rollback mechanisms ready

“We don’t have the monitoring to detect issues”

Build observability and monitoring first (prerequisite for chaos engineering)
Start with non-production until monitoring is adequate
Use canary deployments with chaos experiments

“Our architecture isn’t ready”

Good! Chaos Engineering will reveal exactly what needs improvement
Start small—even simple experiments provide value
Use learnings to prioritize reliability improvements

4. Automate Experiments to Run Continuously

Manual chaos experiments provide one-time insights. Automated continuous chaos provides ongoing confidence that:

New code doesn’t introduce regressions in resilience
System behavior remains resilient as dependencies change
Failure handling mechanisms continue to function

Levels of automation:

Level 1: Manual execution, manual analysis

Run chaos script manually
Observe dashboards and logs manually
Document findings

Level 2: Automated execution, manual analysis

Schedule chaos experiments (e.g., daily)
Automated triggering via CI/CD
Manual review of results

Level 3: Automated execution, automated analysis

Continuous chaos experiments
Automated steady-state verification
Automatic experiment halt on anomalies
Alerts on unexpected behavior

Level 4: Autonomous chaos

AI-driven experiment selection
Dynamic blast radius adjustment
Self-healing experiment design

5. Minimize Blast Radius

Chaos experiments should be carefully scoped to limit potential impact while still providing meaningful insights.

Strategies to minimize blast radius:

Traffic segmentation:

Route only a small percentage of traffic through chaos (e.g., 1-5%)
Use canary deployments with chaos injected only in canary
Test with synthetic traffic first, then real traffic

Geographic isolation:

Run experiments in a single availability zone
Limit to a single region initially
Gradually expand scope as confidence grows

Time constraints:

Run experiments for limited duration (e.g., 5 minutes)
Schedule during low-traffic periods
Avoid running multiple experiments simultaneously initially

Abort conditions:

Define clear criteria for halting experiments (e.g., error rate > 5%)
Implement automatic rollback mechanisms
Have manual kill switch readily available

Chaos Engineering Tools

Chaos Monkey and the Simian Army

Chaos Monkey, created by Netflix in 2011, is the original chaos engineering tool. It randomly terminates EC2 instances in production to ensure that services are resilient to instance failures.

The Simian Army expanded beyond Chaos Monkey with specialized tools:

Chaos Monkey: Randomly terminates virtual machine instances Latency Monkey: Introduces artificial delays in client-server communication Conformity Monkey: Finds instances that don’t adhere to best practices and shuts them down Doctor Monkey: Finds unhealthy instances and removes them from service Janitor Monkey: Searches for unused resources and cleans them up Security Monkey: Finds security violations and terminates offending instances 10-18 Monkey: Detects configuration and run-time problems in instances serving customers in multiple regions

Modern Chaos Monkey implementation:

Netflix open-sourced Chaos Monkey and it’s now available as part of the Spinnaker deployment platform.

Basic Chaos Monkey setup with Spinnaker:

# chaos-monkey-config.yaml
enabled: true
schedule:
  enabled: true
  frequency: 1  # Run every 1 day

terminationStrategy:
  grouping: CLUSTER
  probability: 0.5  # 50% chance a group is chosen for termination
  maxTerminationsPerDay: 1

exceptions:
  # Never terminate instances in these accounts
  accounts:
    - production-critical
  # Never terminate these instance groups
  instanceGroups:
    - auth-service-production
    - payment-processor-production

Running Chaos Monkey manually:

# Chaos Monkey CLI (example)
chaos-monkey \
  --region us-east-1 \
  --cluster api-backend \
  --termination-probability 0.3 \
  --dry-run  # Test without actually terminating

# Remove dry-run to execute
chaos-monkey \
  --region us-east-1 \
  --cluster api-backend \
  --termination-probability 0.3

What Chaos Monkey validates:

Auto-scaling groups respond correctly to instance termination
Load balancers detect unhealthy instances and route around them
Monitoring and alerting detect the issue
Applications gracefully handle missing instances

Gremlin: Enterprise Chaos Engineering Platform

Gremlin is a comprehensive, enterprise-grade chaos engineering platform that provides safe, scalable, and user-friendly chaos experiments.

Key features:

1. Wide range of failure types:

Resource attacks: CPU, memory, disk, I/O exhaustion
State attacks: Shutdown, process killer, time travel
Network attacks: Latency, packet loss, DNS failures, blackhole

2. Safety controls:

Magnitude control (e.g., consume exactly 50% CPU)
Blast radius limiting
Automatic rollback on anomalies
Integration with monitoring systems

3. Scenario-based testing:

Pre-built scenarios (e.g., “AZ outage”, “Database failure”)
Custom scenario creation
Scheduled recurring experiments

Getting started with Gremlin:

Installation (Kubernetes):

# Install Gremlin using Helm
helm repo add gremlin https://helm.gremlin.com
helm install gremlin gremlin/gremlin \
  --namespace gremlin \
  --set gremlin.secret.managed=true \
  --set gremlin.secret.teamID=<YOUR_TEAM_ID> \
  --set gremlin.secret.clusterID=<YOUR_CLUSTER_ID>

Example: CPU attack via Gremlin CLI:

# Consume 50% CPU on specific containers
gremlin attack cpu \
  --target container \
  --labels app=api-backend \
  --percent 50 \
  --length 300  # 5 minutes

# Observe metrics and application behavior during CPU stress

Example: Network latency attack:

# Add 200ms latency to all outgoing traffic
gremlin attack latency \
  --target container \
  --labels app=payment-service \
  --delay 200 \
  --length 180

# Verify:
# - Circuit breakers trigger appropriately
# - Timeouts are properly configured
# - Fallback mechanisms activate

Example: Process kill attack:

# Kill the main application process
gremlin attack process-killer \
  --target container \
  --labels app=order-service \
  --process java

# Validate:
# - Kubernetes restarts the container
# - Health checks detect failure quickly
# - Load balancer stops routing to unhealthy instance
# - No user-facing errors occur

Gremlin Scenarios - Multi-stage experiments:

# scenario.yaml - Simulate Availability Zone failure
name: "AZ Failure Simulation"
description: "Simulates complete failure of one availability zone"

hypothesis: "System maintains functionality with one AZ down"

steps:
  - name: "Shutdown AZ-A instances"
    attack: shutdown
    target:
      type: instance
      tags:
        availability-zone: us-east-1a
    magnitude:
      percent: 100

  - name: "Introduce network latency to AZ-B"
    attack: latency
    target:
      type: instance
      tags:
        availability-zone: us-east-1b
    magnitude:
      delay: 100ms

  - name: "Monitor for 10 minutes"
    duration: 600

validation:
  - metric: "http_requests_success_rate"
    threshold: "> 99%"
  - metric: "api_latency_p99"
    threshold: "< 500ms"

Other Chaos Engineering Tools

Chaos Toolkit - Open-source, extensible chaos engineering toolkit

# Install
pip install chaostoolkit

# Run experiment
chaos run experiment.json

Example experiment (experiment.json):

{
  "title": "System remains available when killing 1 pod",
  "description": "Verify that k8s reschedules pods and service remains available",
  "steady-state-hypothesis": {
    "title": "Service is healthy",
    "probes": [
      {
        "type": "probe",
        "name": "service-is-available",
        "tolerance": true,
        "provider": {
          "type": "http",
          "url": "https://api.example.com/health",
          "status": 200
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "terminate-pod",
      "provider": {
        "type": "python",
        "module": "chaosk8s.pod.actions",
        "func": "terminate_pods",
        "arguments": {
          "label_selector": "app=api-backend",
          "ns": "production",
          "qty": 1
        }
      }
    }
  ],
  "rollbacks": []
}

Litmus - Cloud-native chaos engineering for Kubernetes

# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-latest.yaml

# Run pod-delete experiment
kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/pod-delete/experiment.yaml

PowerfulSeal - Chaos testing tool for Kubernetes

# Install
pip install powerfulseal

# Interactive mode
powerfulseal interactive --use-pod-delete-instead-of-ssh-kill

Pumba - Chaos testing for Docker containers

# Kill random container
pumba kill --random "re2:^myapp"

# Add network delay
pumba netem --duration 3m delay --time 500 myapp-container

GameDays: Practicing Chaos at Scale

GameDays (also called “Chaos Days” or “Disaster Recovery Exercises”) are scheduled events where teams deliberately introduce failures into production or production-like environments to test system resilience and team response.

What is a GameDay?

A GameDay is a structured exercise where:

Multiple teams participate (engineers, SREs, product, support)
Real or realistic failures are injected
Teams respond as they would to actual incidents
Learning and improvement opportunities are identified

GameDay Objectives

Technical objectives:

Validate that systems fail gracefully
Test monitoring and alerting effectiveness
Verify that runbooks and procedures are accurate
Identify single points of failure
Test recovery mechanisms

Organizational objectives:

Build team confidence in incident response
Improve cross-team communication
Validate on-call procedures
Train new team members
Foster culture of resilience

Planning a GameDay

1. Define scope and objectives

Example GameDay Plan:
- Name: "Database Failover GameDay"
- Date: October 15, 2025, 10:00 AM - 2:00 PM
- Participants: Backend team, SRE team, Database team, Product manager
- Objective: Validate automated database failover procedures
- Scenario: Primary database instance failure during peak traffic

2. Build hypothesis

Hypothesis:
When the primary database instance fails:
- Automated failover to replica completes within 30 seconds
- Application error rate remains below 5%
- All transactions in-flight are preserved or properly retried
- Monitoring alerts appropriate teams within 1 minute
- Customer-facing functionality remains available throughout

3. Prepare environment and tools

Set up observability dashboards
Prepare communication channels (Slack, Zoom)
Document rollback procedures
Schedule participants
Notify stakeholders (this is a test)

4. Design failure scenarios

Progressive difficulty:

Scenario 1: Single database replica failure (low impact)
Scenario 2: Primary database failure (medium impact)
Scenario 3: Primary failure + delayed replica promotion (high impact)
Scenario 4: Primary failure + network partition (extreme)

5. Execute GameDay

GameDay timeline example:

10:00 AM - Kickoff
  - Review objectives and scenarios
  - Confirm steady-state metrics
  - Assign roles (coordinator, observers, participants)

10:30 AM - Scenario 1: Replica failure
  - Inject failure
  - Observe system behavior
  - Monitor metrics
  - Take notes on observations

11:00 AM - Debrief Scenario 1
  - What worked well?
  - What didn't work as expected?
  - What surprised us?

11:15 AM - Scenario 2: Primary failure
  - Inject failure
  - Team responds as if real incident
  - Observe communication patterns

11:45 AM - Debrief Scenario 2

12:00 PM - Lunch break

1:00 PM - Scenario 3: Complex failure
  - Multiple simultaneous issues
  - Test escalation procedures

1:30 PM - Final debrief
  - Summary of all learnings
  - Action items identification
  - Prioritization of improvements

2:00 PM - End

6. Post-GameDay activities

Document detailed findings
Create tickets for identified issues
Update runbooks based on learnings
Share outcomes with broader organization
Schedule follow-up GameDay after improvements

GameDay Best Practices

Before:

Get explicit approval from stakeholders
Notify customer support teams
Prepare “abort” procedures
Test in staging first if possible

During:

Assign a dedicated coordinator
Document everything in real-time
Take screenshots of monitoring dashboards
Record team communications
Don’t rush—allow time for observation

After:

Conduct blameless post-mortem
Focus on systems, not individuals
Celebrate learnings (even from failures)
Track remediation progress
Schedule regular recurring GameDays

Monitoring and Observability: Prerequisites for Chaos Engineering

You cannot practice Chaos Engineering effectively without robust monitoring and observability. You need to:

Detect when chaos experiments cause degradation
Understand the blast radius of failures
Verify that steady-state is maintained
Abort experiments when anomalies occur

The Three Pillars of Observability

1. Metrics - Aggregated numerical data over time

Request rate, error rate, duration (RED metrics)
CPU, memory, disk, network (USE metrics)
Business metrics (orders/min, revenue, user signups)

Tools: Prometheus, Datadog, New Relic, CloudWatch

2. Logs - Discrete events with context

Application logs
Access logs
Error logs
Audit logs

Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Loki

3. Traces - Request flow through distributed systems

End-to-end request visualization
Latency breakdown by service
Dependency mapping

Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM

Essential Metrics for Chaos Experiments

Application metrics:

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Latency (p50, p95, p99)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Availability
up{job="api-backend"}

Infrastructure metrics:

# CPU utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory utilization
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk space
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

Business metrics:

Successful checkouts per minute
Video streams started per second
API calls returning valid data
Revenue per minute

Dashboards for Chaos Experiments

Create dedicated chaos engineering dashboards that show:

Steady-state indicators (prominent display)

Primary business metric (e.g., orders/min)
Error rate
Latency percentiles

System health (secondary display)

Service availability (up/down status)
Resource utilization
Dependency health

Experiment context (annotations)

When experiment started/stopped
Type of failure injected
Blast radius (affected services/regions)

Example Grafana dashboard JSON snippet:

{
  "dashboard": {
    "title": "Chaos Engineering Dashboard",
    "panels": [
      {
        "title": "Orders Per Minute (Steady State)",
        "targets": [
          {
            "expr": "rate(orders_total[1m]) * 60"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [1000],
                "type": "lt"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ]
        }
      }
    ],
    "annotations": {
      "list": [
        {
          "datasource": "Prometheus",
          "enable": true,
          "expr": "chaos_experiment_active",
          "name": "Chaos Experiments",
          "tagKeys": "experiment_type"
        }
      ]
    }
  }
}

Building a Chaos Engineering Culture

Successful Chaos Engineering requires more than tools—it requires organizational commitment and cultural change.

Start Small and Build Confidence

Phase 1: Education and Awareness

Share Chaos Engineering concepts with teams
Demonstrate simple experiments in staging
Discuss benefits and address concerns

Phase 2: Manual Experiments in Non-Production

Run manual chaos experiments in staging
Document findings
Build confidence in approach

Phase 3: Automated Experiments in Non-Production

Automate recurring experiments
Integrate with CI/CD pipelines
Expand experiment types

Phase 4: Production Experiments (Small Blast Radius)

Run simple experiments in production (e.g., 1% traffic)
Carefully monitor and document
Build organizational trust

Phase 5: Continuous Chaos in Production

Fully automated, continuous chaos
Multiple concurrent experiments
Chaos as part of development workflow

Blameless Culture

Chaos Engineering reveals weaknesses. If teams fear blame for discovering issues, they won’t experiment.

Foster psychological safety:

Celebrate findings (even when systems fail)
Focus on systems, not individuals
Treat failures as learning opportunities
Reward proactive chaos testing

Measure Success

Success metrics for Chaos Engineering programs:

Number of weaknesses discovered before customer impact
Reduction in Mean Time To Detection (MTTD)
Reduction in Mean Time To Recovery (MTTR)
Increase in team confidence (surveys)
Reduction in severity of production incidents

Conclusion

Chaos Engineering transforms how we build and operate systems. By deliberately introducing failures, we move from hoping our systems are resilient to knowing they are resilient through empirical evidence.

Key takeaways:

Chaos Engineering is scientific experimentation applied to distributed systems
Failures are inevitable in complex systems—embrace and prepare for them
Start small with simple experiments and gradually expand scope
Observability is prerequisite—you must be able to measure steady-state
Production is the best laboratory but start with safeguards
Automation enables continuous validation of system resilience
GameDays build confidence and prepare teams for real incidents
Culture matters—foster blameless learning environments

Chaos Engineering is not about breaking things—it’s about building confidence in our ability to handle breakage when it inevitably occurs. Start your chaos journey today, break your systems in controlled ways, and build truly resilient systems that users can depend on.