Reliability & Resilience

Circuit breakers, retries, chaos engineering, and graceful degradation.

Concepts

The circuit breaker pattern prevents cascading failures in distributed systems by wrapping calls to remote services in a state machine that trips open when failures exceed a threshold, immediately rejecting requests instead of waiting for timeouts. It enables fast failure, protects downstream services from overload during recovery, and automatically probes for recovery.

Bulkhead PatternP0

The bulkhead pattern isolates system components into independent compartments so that a failure in one component does not exhaust shared resources and bring down the entire system. Named after ship bulkheads that contain flooding, this pattern uses thread pool isolation, semaphore limits, separate processes, and connection pool partitioning to prevent cascading failures.

Timeouts and Deadline PropagationP0

Every network call must have a timeout to prevent indefinite resource holding. Deadline propagation passes the remaining time budget through the entire call chain, ensuring downstream services do not start work they cannot finish. Together, timeouts and deadlines are the most fundamental reliability mechanism in distributed systems.

Retry with Exponential Backoff and JitterP0

Retrying transient failures with exponential backoff avoids overwhelming a recovering service by increasing wait times between attempts. Adding jitter randomizes retry timing to prevent thundering herd problems where all clients retry simultaneously. Combined with idempotency and retry budgets, this pattern is fundamental to reliable distributed communication.

Graceful DegradationP0

Graceful degradation is the practice of serving reduced-quality but functional responses when a dependency fails, rather than returning errors. By falling back to cached data, disabling non-critical features, or returning partial results, systems maintain core functionality during partial outages and provide a significantly better user experience than hard failures.

Load SheddingP0

Load shedding intentionally rejects excess requests when a system is at or near capacity, ensuring that the requests it does process are served well. By proactively dropping traffic before the system becomes overloaded, load shedding prevents the degraded performance that affects all requests during overload.

Rate LimitingP0

Rate limiting controls the number of requests a client can make within a time window, protecting services from abuse, ensuring fair resource usage, and preventing resource exhaustion. Algorithms range from simple fixed window counters to sophisticated token bucket and sliding window approaches, each with distinct trade-offs for burst handling, memory usage, and accuracy.

Chaos EngineeringP0

Chaos engineering is the discipline of deliberately injecting failures into production systems to discover weaknesses before they cause real outages. By running controlled experiments -- killing instances, injecting latency, partitioning networks -- teams build confidence that their systems can withstand turbulent conditions.

Disaster Recovery (DR)P0

Disaster recovery encompasses the strategies, processes, and infrastructure for recovering from catastrophic failures such as regional outages, data corruption, or ransomware attacks. DR planning centers on two key metrics: RPO (how much data loss is acceptable) and RTO (how much downtime is acceptable), which determine the cost and complexity of the DR strategy.

Multi-Region Deployment StrategiesP0

Multi-region deployment runs application infrastructure across multiple geographic regions to improve availability, reduce latency for global users, and meet data sovereignty compliance requirements. Strategies range from simple active-passive failover to complex active-active architectures, each with distinct trade-offs for data consistency, operational complexity, and cost.