Vetora logo
Reliability & Resilience

Circuit Breaker Pattern

The circuit breaker pattern prevents cascading failures in distributed systems by wrapping calls to remote services in a state machine that trips open when failures exceed a threshold, immediately rejecting requests instead of waiting for timeouts. It enables fast failure, protects downstream services from overload during recovery, and automatically probes for recovery.

Overview

The circuit breaker pattern, popularized by Michael Nygard in his 2007 book Release It!, is a stability pattern that prevents cascading failures in distributed systems by monitoring the health of downstream service calls and short-circuiting requests when a dependency is unhealthy. It operates as a three-state state machine: Closed (normal operation, requests flow through and failures are counted), Open (the breaker has tripped, all requests are immediately rejected without calling the downstream service), and Half-Open (a limited number of probe requests are allowed through to test whether the downstream service has recovered).

The core problem the circuit breaker solves is resource exhaustion during downstream failures. Without a circuit breaker, when a downstream service becomes slow or unresponsive, upstream callers continue sending requests that block on timeouts. Each blocked request holds a thread, a connection, and memory. If the upstream service has a thread pool of 200 threads and the downstream timeout is 30 seconds, a failing downstream can consume all 200 threads in under a minute, making the upstream service unresponsive to all requests -- even those that do not depend on the failing downstream. This cascading failure pattern can propagate through an entire microservice graph within minutes, turning a single service degradation into a full platform outage.

The circuit breaker prevents this by failing fast. When the failure rate crosses a configurable threshold (for example, 50% of requests failing within a 10-second window), the breaker trips to the Open state. In this state, calls return immediately with an error or fallback response, freeing threads within milliseconds instead of blocking for the full timeout duration. After a configurable cooldown period (typically 30-60 seconds), the breaker transitions to Half-Open and allows a small number of probe requests through. If these probes succeed, the breaker resets to Closed and normal traffic resumes. If they fail, the breaker returns to Open for another cooldown period.

Implementation options include Resilience4j (the modern JVM standard, replacing Netflix Hystrix which entered maintenance mode in 2018), Polly for .NET, and custom middleware in languages like Go and Python. In a service mesh architecture, circuit breaking can be implemented at the infrastructure layer using Envoy proxy or Istio, removing the need for application-level libraries. Key metrics to monitor include error rate percentage, circuit state transitions, response time percentiles during each state, and the ratio of rejected requests during the Open state. Netflix pioneered large-scale circuit breaker adoption with Hystrix, processing over 10 billion circuit-breaker decisions per day at peak, with each of their 1000+ microservices wrapped in circuit breakers to prevent any single dependency from taking down the platform.

Key Points
  • 1The circuit breaker has three states: Closed (normal, failures counted), Open (requests immediately rejected), and Half-Open (limited probe requests test recovery). Transitions are driven by failure thresholds and cooldown timers.
  • 2Without circuit breakers, a slow downstream service can consume all threads in upstream callers through timeout accumulation. A 200-thread pool with 30-second timeouts can be fully exhausted by a single failing dependency in under a minute.
  • 3Failure detection uses a sliding window -- either count-based (last N calls) or time-based (last N seconds). The threshold should account for normal error rates; tripping at 1% errors when baseline is 0.5% causes unnecessary disruptions.
  • 4The Half-Open state is critical for automatic recovery. Without it, the circuit would require manual intervention to close after a downstream service recovers, defeating the purpose of automated resilience.
  • 5Circuit breakers complement but do not replace timeouts. The timeout prevents individual calls from blocking indefinitely; the circuit breaker prevents accumulation of many timed-out calls from exhausting the caller's resources.
  • 6In service mesh architectures (Istio, Envoy), circuit breaking moves from application code to infrastructure configuration, providing consistent behavior across all services regardless of language or framework.
Simple Example

The Electrical Circuit Breaker Analogy

A circuit breaker in your home works the same way. Normally, electricity flows through (Closed state). If too much current flows -- perhaps from a short circuit -- the breaker trips and cuts the circuit (Open state), protecting your house from fire. You do not keep pushing more electricity through a short circuit hoping it fixes itself. After some time, you flip the breaker back on to test if the problem is resolved (Half-Open state). If the current flows normally, great -- the breaker stays on. If it immediately trips again, you know the underlying problem is not fixed and the breaker stays off. The software pattern works identically: it protects your system from being destroyed by a failing dependency, and it automatically tests for recovery.

Real-World Examples

Netflix

Netflix developed Hystrix, the most widely adopted circuit breaker library, to protect their microservice architecture from cascading failures. At peak, Hystrix processed over 10 billion circuit-breaker decisions per day across 1000+ microservices. Every external dependency call -- from recommendation engines to user profile services -- was wrapped in a Hystrix command with per-dependency thread pools and circuit breaker thresholds. When Netflix deprecated Hystrix in 2018, the patterns it established became the foundation for Resilience4j and similar libraries.

Uber

Uber implements circuit breakers in every microservice gateway to protect against cascading failures during peak ride-request periods. Their circuit breakers use adaptive thresholds that adjust based on current traffic volume -- a 5% error rate at 100 requests per second may be noise, but 5% at 50,000 RPS represents a real problem. Uber combines circuit breakers with load shedding to ensure that ride-matching (critical path) remains available even when non-critical services like analytics or driver earnings history experience failures.

Shopify

Shopify uses circuit breakers extensively to protect their payment processing pipeline during flash sales, when traffic can spike 100x within seconds. Their circuit breakers wrap calls to external payment providers (Stripe, PayPal, bank APIs), with per-provider breakers allowing one provider's outage to be isolated while others continue processing. During Black Friday/Cyber Monday events, circuit breakers have prevented single payment provider slowdowns from blocking the entire checkout pipeline.

Trade-Offs
AspectDescription
Fast Failure vs Potential RecoveryAn aggressively configured circuit breaker (low threshold, long cooldown) rejects requests quickly but may stay open longer than necessary, rejecting valid requests even after the downstream has recovered. A conservatively configured breaker (high threshold, short cooldown) allows more failing requests through before tripping, consuming more resources but avoiding false positives.
Complexity vs ResilienceCircuit breakers add operational complexity: each breaker needs tuned thresholds, monitoring dashboards, and alerting. In a system with hundreds of microservices, each calling multiple dependencies, the number of circuit breakers to configure and monitor can reach into the thousands. Service mesh implementations reduce per-service complexity but add infrastructure-level complexity.
Fallback Quality vs User ExperienceWhen a circuit breaker is open, the system must decide what to return: an error, cached data, a default response, or a degraded feature. Cached or default responses may be stale or incomplete. Returning errors is honest but degrades user experience. The fallback strategy must be designed and tested for each dependency.
Per-Dependency Isolation vs Resource EfficiencyThread-pool isolation (each dependency gets its own thread pool) provides the strongest isolation but consumes more memory and threads. Semaphore isolation (shared threads with concurrency limits) is more resource-efficient but provides weaker isolation -- a slow dependency can still block shared threads up to the semaphore limit.
Case Study

Netflix Hystrix -- Preventing Cascading Failure at Scale

Scenario

Netflix's microservice architecture grew to over 1000 services making billions of inter-service calls daily. A single slow or failing service -- such as the recommendation engine experiencing elevated latency due to a bad deployment -- would cause upstream services to accumulate blocked threads waiting for responses. This cascading effect could propagate through the entire call graph, turning a single-service issue into a platform-wide outage affecting millions of users.

Solution

Netflix built Hystrix, an open-source circuit breaker library that wrapped every external dependency call in a command object with its own thread pool, timeout, and circuit breaker. Each dependency was isolated: the recommendation service got a thread pool of 20 threads, the user profile service got 15, and so on. If the recommendation service slowed down, only its 20 threads would be consumed -- the remaining services continued operating normally. Circuit breakers tripped at a 50% error rate within a 10-second rolling window, rejecting further calls for 5 seconds before probing for recovery.

Outcome

Hystrix reduced cascading failures at Netflix by over 90%. During major incidents, the blast radius was contained to the specific failing dependency and its direct consumers, while the rest of the platform -- including video playback, the most critical service -- continued operating normally. Hystrix processed over 10 billion circuit-breaker decisions per day and became the industry standard for microservice resilience, inspiring similar implementations across the industry.

Common Mistakes
  • Setting the failure threshold too low, causing the circuit breaker to trip on normal transient errors. If a service has a baseline error rate of 0.5%, setting the threshold at 1% will cause frequent false trips. Analyze baseline error rates before configuring thresholds.
  • Not implementing fallback behavior when the circuit is open. A circuit breaker that simply throws an exception when open shifts the problem upstream without improving user experience. Design meaningful fallbacks: cached data, default responses, or graceful feature degradation.
  • Using a single circuit breaker for multiple downstream services. If service A calls both the payment service and the recommendation service through one breaker, a payment service failure will also block recommendation calls. Use per-dependency circuit breakers for proper isolation.
  • Ignoring the half-open state configuration. If the half-open state allows too many probe requests, a recovering service can be overwhelmed. If it allows too few, recovery detection is slow. Typically 1-5 probe requests is appropriate for the half-open state.
Related Concepts

See Circuit Breaker Pattern in action

Explore system design templates that use circuit breaker pattern and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Watch circuit breakers prevent cascading failure in an e-commerce system

Metrics to watch
error_rate_pctp99_latency_mscircuit_breaker_state
Run Simulation
Test Your Understanding

1What happens when a circuit breaker transitions from Closed to Open state?

2Why is the Half-Open state important in the circuit breaker pattern?

3Which failure mode does the circuit breaker pattern primarily prevent?

Deeper Reading