Vetora logo
🐒Reliability & Resilience

Chaos Engineering

Chaos engineering is the discipline of deliberately injecting failures into production systems to discover weaknesses before they cause real outages. By running controlled experiments -- killing instances, injecting latency, partitioning networks -- teams build confidence that their systems can withstand turbulent conditions.

Overview

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It emerged from Netflix's experience migrating to AWS in 2010, where they realized that failures in cloud infrastructure were inevitable and frequent. Rather than hoping their systems could handle failures, Netflix decided to proactively inject failures during business hours when engineers were available to respond, rather than being surprised by them at 3 AM. This counterintuitive approach -- deliberately breaking things to make them more reliable -- has become an industry-standard practice adopted by Amazon, Google, Microsoft, Uber, and hundreds of other organizations.

The chaos engineering process follows the scientific method. First, define the system's steady state -- the normal behavior measured by key metrics (request success rate, p99 latency, error rate). Second, hypothesize what will happen when a specific failure is injected: 'If we kill 3 of 12 instances of the recommendation service, the system should automatically route traffic to the remaining 9 instances with no user-visible impact.' Third, inject the failure: actually kill those 3 instances in production. Fourth, observe: did the metrics stay within acceptable bounds? If yes, your hypothesis was confirmed and you have higher confidence in the system's resilience. If no, you have discovered a weakness that needs to be fixed before it causes a real outage. The key insight is that you learn from both outcomes.

Chaos experiments span a wide range of failure types, each testing different resilience mechanisms. Instance termination (Chaos Monkey) tests auto-scaling and load balancing. Latency injection (adding artificial delay to network calls) tests timeout configurations, circuit breakers, and graceful degradation. Network partition (blocking traffic between availability zones or services) tests CAP behavior and data replication. Resource exhaustion (filling disks, consuming CPU, allocating all memory) tests resource limits and monitoring. DNS failure tests service discovery resilience. Clock skew tests systems that depend on synchronized time. Each experiment type reveals different categories of weaknesses.

Blast radius control is essential. Chaos engineering is not about randomly breaking things in production and hoping for the best. Experiments should start with the smallest possible blast radius (single instance in a non-production environment), expand gradually as confidence grows (single instance in production, then multiple instances), and always have automated rollback mechanisms. Netflix runs Chaos Monkey continuously in production, but only during business hours, with a kill switch, and with teams on standby. GameDay exercises -- coordinated, scheduled events where teams conduct larger-scale chaos experiments -- provide a structured environment for testing more destructive scenarios. Amazon runs GameDay exercises before major events like Prime Day to validate their disaster recovery procedures.

Key Points
  • 1Chaos engineering follows the scientific method: define steady state, hypothesize about failure impact, inject the failure, observe results. You learn whether your system handles the failure correctly or discover a weakness that needs fixing.
  • 2Start with small blast radius and expand gradually. Begin in staging, move to production with single-instance experiments, and only expand to multi-instance or multi-service experiments after building confidence. Always have automated rollback.
  • 3Types of experiments include: instance termination (Chaos Monkey), latency injection, network partitions, resource exhaustion (disk, CPU, memory), DNS failures, and clock skew. Each type tests different resilience mechanisms.
  • 4Run experiments during business hours when engineers are available to respond, not at 3 AM when problems are hardest to diagnose. The goal is learning, not heroics. Netflix's Chaos Monkey only runs during US business hours.
  • 5GameDay exercises are planned, coordinated events where teams conduct larger-scale chaos experiments. Amazon runs GameDays before Prime Day; Slack runs annual DiRT (Disaster Recovery Testing) exercises. These build organizational muscle memory for incident response.
  • 6Chaos engineering reveals the gap between what you think will happen and what actually happens. Common surprises include: timeouts set too high, circuit breakers never tested, fallback paths with bugs, monitoring blind spots, and auto-scaling configurations that do not work.
Simple Example

The Fire Drill Analogy

A fire drill is chaos engineering for buildings. You do not wait for a real fire to discover that the emergency exit is blocked, the fire alarm batteries are dead, or employees do not know the evacuation route. Instead, you deliberately trigger the alarm during a normal workday, observe how people respond, and fix problems while the stakes are low. Similarly, chaos engineering triggers failures during business hours when engineers are available, revealing problems (misconfigured timeouts, untested failover paths, broken alerts) that are far cheaper to fix in a controlled experiment than during a real outage at peak traffic.

Real-World Examples

Netflix

Netflix pioneered chaos engineering with Chaos Monkey, which randomly terminates virtual machine instances in production during business hours. This forces all Netflix services to be designed for instance failure from the start. Netflix expanded the concept with the Simian Army: Chaos Gorilla (simulates entire availability zone failure), Latency Monkey (injects artificial delays), and Conformity Monkey (checks for instances that do not meet best practices). Chaos Monkey runs continuously in production, killing hundreds of instances daily.

Amazon

Amazon conducts large-scale GameDay exercises before major events like Prime Day. During GameDays, teams inject failures including regional failovers, database failovers, service degradation, and network partitions to validate that the platform can handle real-world failure scenarios under production-level load. GameDays are coordinated events with all relevant teams on standby, and results feed directly into action items for improving resilience before the actual event.

Slack

Slack runs annual DiRT (Disaster Recovery Testing) exercises where they inject major failures into their production infrastructure. Past exercises have included simulating complete data center loss, database corruption, and cascading service failures. These exercises are scheduled during business hours with full team participation, and the findings are documented and prioritized for remediation. DiRT exercises have uncovered issues with backup restoration procedures, failover timing, and monitoring gaps that would have been invisible without deliberate testing.

Trade-Offs
AspectDescription
Risk of User Impact vs Discovery of WeaknessesRunning chaos experiments in production carries inherent risk of user-facing impact. A poorly controlled experiment can cause real outages. However, running experiments only in staging misses production-specific issues (real traffic patterns, real data volumes, real infrastructure configurations). The trade-off is managed by starting small, expanding gradually, and maintaining automated rollback.
Engineering Time vs Resilience ConfidenceBuilding and maintaining a chaos engineering practice requires significant engineering investment: tooling, experiment design, result analysis, and remediation of discovered weaknesses. This time competes with feature development. The payoff is measured in outages prevented, which is inherently difficult to quantify -- you cannot count incidents that did not happen.
Continuous vs Scheduled ExperimentsContinuous chaos (Chaos Monkey running 24/7) provides ongoing confidence that new deployments do not introduce fragility, but increases the chance of unexpected impact. Scheduled experiments (monthly GameDays) are more controlled but only test resilience at specific points in time. The best approach combines both: continuous basic experiments (instance termination) with periodic comprehensive GameDays.
Organizational Culture vs Technical ReadinessChaos engineering requires organizational buy-in: management must accept the risk of controlled incidents, and teams must treat discovered weaknesses as learning opportunities rather than blame. Organizations with blameful incident cultures struggle to adopt chaos engineering because nobody wants to be responsible for an experiment that causes impact. Cultural readiness is as important as technical readiness.
Case Study

Netflix Chaos Monkey -- Building Resilience into the Culture

Scenario

In 2010, Netflix began migrating from its own data centers to AWS. Cloud infrastructure was inherently less reliable than dedicated hardware -- individual instances could be terminated at any time, network latency was variable, and services shared infrastructure with other AWS customers. Netflix's engineers were accustomed to stable hardware and had not designed services to handle frequent instance failures. Early AWS outages revealed that many Netflix services crashed or degraded severely when even a single instance was lost.

Solution

Netflix created Chaos Monkey, a tool that randomly terminates virtual machine instances in production during business hours. The premise was simple but radical: if Netflix services must survive random instance failures in production, the best way to ensure this is to cause random instance failures constantly. Engineers who deployed a service that could not survive instance termination would discover the problem within days, during business hours, with colleagues available to help fix it. Chaos Monkey made resilience a continuous concern rather than an afterthought. Netflix expanded the concept with the Simian Army, adding specialized tools for availability zone failure, latency injection, security compliance checking, and resource cleanup.

Outcome

Chaos Monkey fundamentally changed how Netflix engineers design services. Every Netflix service is built to survive instance termination because the alternative -- getting paged every time Chaos Monkey kills one of your instances -- is unacceptable. Netflix reports that Chaos Monkey-driven resilience improvements have prevented hundreds of potential outages. When real AWS outages occurred (the 2011 US-East-1 outage, for example), Netflix was one of the few major AWS customers whose service continued operating because their systems were already designed and tested for instance failure. The chaos engineering philosophy spread across the industry, inspiring similar practices at Amazon, Google, Microsoft, and hundreds of other companies.

Common Mistakes
  • Running chaos experiments in production without automated rollback. Every experiment must have a clear abort procedure and automated rollback mechanism. If the experiment causes unexpected impact, you need to be able to stop it within seconds, not minutes.
  • Starting with large blast radius experiments. Beginning by killing an entire availability zone in production is reckless. Start with a single instance in staging, graduate to a single instance in production, and only expand blast radius after building confidence and observing successful results at smaller scales.
  • Treating chaos engineering as random destruction. Chaos engineering follows the scientific method: define hypothesis, run controlled experiment, observe results, learn. Randomly breaking things without a hypothesis, without observation, or without follow-up remediation is just vandalism, not engineering.
  • Only running chaos experiments in staging environments. Staging rarely mirrors production in terms of traffic patterns, data volume, infrastructure configuration, and team response. While staging is the right place to start, experiments must eventually run in production to provide real confidence in system resilience.
Related Concepts

See Chaos Engineering in action

Explore system design templates that use chaos engineering and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Inject failures and observe system resilience

Metrics to watch
failure_injection_raterecovery_time_msavailability_pcterror_rate_pct
Run Simulation
Test Your Understanding

1What is the correct process for conducting a chaos engineering experiment?

2Why does Netflix run Chaos Monkey during business hours rather than overnight?

3What is a GameDay exercise in the context of chaos engineering?

Deeper Reading