1What is the correct process for conducting a chaos engineering experiment?
Chaos engineering is the discipline of deliberately injecting failures into production systems to discover weaknesses before they cause real outages. By running controlled experiments -- killing instances, injecting latency, partitioning networks -- teams build confidence that their systems can withstand turbulent conditions.
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It emerged from Netflix's experience migrating to AWS in 2010, where they realized that failures in cloud infrastructure were inevitable and frequent. Rather than hoping their systems could handle failures, Netflix decided to proactively inject failures during business hours when engineers were available to respond, rather than being surprised by them at 3 AM. This counterintuitive approach -- deliberately breaking things to make them more reliable -- has become an industry-standard practice adopted by Amazon, Google, Microsoft, Uber, and hundreds of other organizations.
The chaos engineering process follows the scientific method. First, define the system's steady state -- the normal behavior measured by key metrics (request success rate, p99 latency, error rate). Second, hypothesize what will happen when a specific failure is injected: 'If we kill 3 of 12 instances of the recommendation service, the system should automatically route traffic to the remaining 9 instances with no user-visible impact.' Third, inject the failure: actually kill those 3 instances in production. Fourth, observe: did the metrics stay within acceptable bounds? If yes, your hypothesis was confirmed and you have higher confidence in the system's resilience. If no, you have discovered a weakness that needs to be fixed before it causes a real outage. The key insight is that you learn from both outcomes.
Chaos experiments span a wide range of failure types, each testing different resilience mechanisms. Instance termination (Chaos Monkey) tests auto-scaling and load balancing. Latency injection (adding artificial delay to network calls) tests timeout configurations, circuit breakers, and graceful degradation. Network partition (blocking traffic between availability zones or services) tests CAP behavior and data replication. Resource exhaustion (filling disks, consuming CPU, allocating all memory) tests resource limits and monitoring. DNS failure tests service discovery resilience. Clock skew tests systems that depend on synchronized time. Each experiment type reveals different categories of weaknesses.
Blast radius control is essential. Chaos engineering is not about randomly breaking things in production and hoping for the best. Experiments should start with the smallest possible blast radius (single instance in a non-production environment), expand gradually as confidence grows (single instance in production, then multiple instances), and always have automated rollback mechanisms. Netflix runs Chaos Monkey continuously in production, but only during business hours, with a kill switch, and with teams on standby. GameDay exercises -- coordinated, scheduled events where teams conduct larger-scale chaos experiments -- provide a structured environment for testing more destructive scenarios. Amazon runs GameDay exercises before major events like Prime Day to validate their disaster recovery procedures.
The Fire Drill Analogy
A fire drill is chaos engineering for buildings. You do not wait for a real fire to discover that the emergency exit is blocked, the fire alarm batteries are dead, or employees do not know the evacuation route. Instead, you deliberately trigger the alarm during a normal workday, observe how people respond, and fix problems while the stakes are low. Similarly, chaos engineering triggers failures during business hours when engineers are available, revealing problems (misconfigured timeouts, untested failover paths, broken alerts) that are far cheaper to fix in a controlled experiment than during a real outage at peak traffic.
Netflix
Netflix pioneered chaos engineering with Chaos Monkey, which randomly terminates virtual machine instances in production during business hours. This forces all Netflix services to be designed for instance failure from the start. Netflix expanded the concept with the Simian Army: Chaos Gorilla (simulates entire availability zone failure), Latency Monkey (injects artificial delays), and Conformity Monkey (checks for instances that do not meet best practices). Chaos Monkey runs continuously in production, killing hundreds of instances daily.
Amazon
Amazon conducts large-scale GameDay exercises before major events like Prime Day. During GameDays, teams inject failures including regional failovers, database failovers, service degradation, and network partitions to validate that the platform can handle real-world failure scenarios under production-level load. GameDays are coordinated events with all relevant teams on standby, and results feed directly into action items for improving resilience before the actual event.
Slack
Slack runs annual DiRT (Disaster Recovery Testing) exercises where they inject major failures into their production infrastructure. Past exercises have included simulating complete data center loss, database corruption, and cascading service failures. These exercises are scheduled during business hours with full team participation, and the findings are documented and prioritized for remediation. DiRT exercises have uncovered issues with backup restoration procedures, failover timing, and monitoring gaps that would have been invisible without deliberate testing.
| Aspect | Description |
|---|---|
| Risk of User Impact vs Discovery of Weaknesses | Running chaos experiments in production carries inherent risk of user-facing impact. A poorly controlled experiment can cause real outages. However, running experiments only in staging misses production-specific issues (real traffic patterns, real data volumes, real infrastructure configurations). The trade-off is managed by starting small, expanding gradually, and maintaining automated rollback. |
| Engineering Time vs Resilience Confidence | Building and maintaining a chaos engineering practice requires significant engineering investment: tooling, experiment design, result analysis, and remediation of discovered weaknesses. This time competes with feature development. The payoff is measured in outages prevented, which is inherently difficult to quantify -- you cannot count incidents that did not happen. |
| Continuous vs Scheduled Experiments | Continuous chaos (Chaos Monkey running 24/7) provides ongoing confidence that new deployments do not introduce fragility, but increases the chance of unexpected impact. Scheduled experiments (monthly GameDays) are more controlled but only test resilience at specific points in time. The best approach combines both: continuous basic experiments (instance termination) with periodic comprehensive GameDays. |
| Organizational Culture vs Technical Readiness | Chaos engineering requires organizational buy-in: management must accept the risk of controlled incidents, and teams must treat discovered weaknesses as learning opportunities rather than blame. Organizations with blameful incident cultures struggle to adopt chaos engineering because nobody wants to be responsible for an experiment that causes impact. Cultural readiness is as important as technical readiness. |
Netflix Chaos Monkey -- Building Resilience into the Culture
Scenario
In 2010, Netflix began migrating from its own data centers to AWS. Cloud infrastructure was inherently less reliable than dedicated hardware -- individual instances could be terminated at any time, network latency was variable, and services shared infrastructure with other AWS customers. Netflix's engineers were accustomed to stable hardware and had not designed services to handle frequent instance failures. Early AWS outages revealed that many Netflix services crashed or degraded severely when even a single instance was lost.
Solution
Netflix created Chaos Monkey, a tool that randomly terminates virtual machine instances in production during business hours. The premise was simple but radical: if Netflix services must survive random instance failures in production, the best way to ensure this is to cause random instance failures constantly. Engineers who deployed a service that could not survive instance termination would discover the problem within days, during business hours, with colleagues available to help fix it. Chaos Monkey made resilience a continuous concern rather than an afterthought. Netflix expanded the concept with the Simian Army, adding specialized tools for availability zone failure, latency injection, security compliance checking, and resource cleanup.
Outcome
Chaos Monkey fundamentally changed how Netflix engineers design services. Every Netflix service is built to survive instance termination because the alternative -- getting paged every time Chaos Monkey kills one of your instances -- is unacceptable. Netflix reports that Chaos Monkey-driven resilience improvements have prevented hundreds of potential outages. When real AWS outages occurred (the 2011 US-East-1 outage, for example), Netflix was one of the few major AWS customers whose service continued operating because their systems were already designed and tested for instance failure. The chaos engineering philosophy spread across the industry, inspiring similar practices at Amazon, Google, Microsoft, and hundreds of other companies.
See Chaos Engineering in action
Explore system design templates that use chaos engineering and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the correct process for conducting a chaos engineering experiment?
2Why does Netflix run Chaos Monkey during business hours rather than overnight?
3What is a GameDay exercise in the context of chaos engineering?