1Why is it better to reject 10% of requests during overload than to accept all requests?
Load shedding intentionally rejects excess requests when a system is at or near capacity, ensuring that the requests it does process are served well. By proactively dropping traffic before the system becomes overloaded, load shedding prevents the degraded performance that affects all requests during overload.
Load shedding is the practice of intentionally rejecting incoming requests when a system is at or approaching its capacity limits. The fundamental insight is counterintuitive: it is better to reject some requests outright and serve the remaining requests well than to accept all requests and serve them all poorly. Without load shedding, an overloaded system exhibits a characteristic failure pattern: as load increases beyond capacity, response latency increases for all requests, timeouts begin firing, clients retry (adding more load), latency increases further, and the system enters a death spiral where it spends most of its resources on requests it will never finish. Load shedding breaks this pattern by maintaining a clear boundary between 'in capacity' (serve well) and 'over capacity' (reject fast).
The simplest load shedding strategy is random rejection: when the system detects overload (CPU > 80%, queue depth > threshold), it randomly rejects a percentage of incoming requests with HTTP 503 or 429, along with a Retry-After header. This is fair -- no client is systematically penalized -- but undiscriminating: a critical checkout request and a low-priority analytics ping are equally likely to be rejected. Priority-based shedding improves on this by assigning priority levels to different request types and shedding from the lowest priority first. Batch jobs are shed before search queries, search queries before product browsing, and product browsing before checkout. This ensures the highest-value user journeys remain functional even during overload.
More sophisticated approaches use queue theory to make shedding decisions. LIFO (Last In, First Out) queue processing ensures that when the queue is deep, the oldest requests (which have likely already timed out on the client side) are the ones discarded, while fresh requests are processed first. This avoids wasting compute on requests whose callers have already moved on. CoDel (Controlled Delay), developed at Google and inspired by networking research, takes a queue-aware approach: it monitors how long requests spend waiting in the queue, and when the minimum queue wait time exceeds a target (e.g., 5ms) for a sustained period, it begins dropping requests that have waited too long. CoDel is self-tuning -- it adapts to the system's actual capacity rather than requiring manually configured thresholds.
Effective load shedding requires client cooperation. The server should return machine-readable rejection responses: HTTP 429 (Too Many Requests) with a Retry-After header telling the client when to try again, or HTTP 503 (Service Unavailable) for transient overload. Well-behaved clients respect the Retry-After header and implement exponential backoff with jitter. Without client cooperation, rejected requests are immediately retried, adding to the load and defeating the purpose of shedding. The client-server contract around load shedding -- 429/503 responses, Retry-After headers, and backoff strategies -- is as important as the shedding algorithm itself.
The Nightclub Bouncer Analogy
A nightclub has a fire-code capacity of 500 people. When 500 people are inside, the bouncer stops letting people in -- even if there is a long line. This is load shedding. The bouncer is not being mean; they are preventing overcrowding that would make the experience terrible for everyone inside and potentially dangerous. If the bouncer let everyone in, the club would be uncomfortably packed, the bar would be overwhelmed, and nobody would have a good time. By maintaining a capacity boundary, the 500 people inside have a great experience, and people in line know they need to wait rather than being packed into an unpleasant situation.
Google uses CoDel-based load shedding in their Borg cluster manager to protect services from overload. CoDel monitors the minimum queue wait time over sliding intervals. When the minimum exceeds a target (typically 5ms) for a sustained period, it begins dropping requests that have queued for too long. This approach is self-tuning: it adapts to the system's actual capacity without requiring manually configured thresholds, and it naturally backs off when the overload subsides.
Uber
Uber implements priority-based load shedding to protect their ride-matching pipeline. During peak demand, non-critical requests like driver earnings history, trip analytics, and promotional notifications are shed first. Ride requests, the core business function, receive the highest priority and are shed last. This ensures that even during extreme load events (New Year's Eve, major sporting events), users can still request and receive rides, even if some peripheral features are temporarily unavailable.
Cloudflare
Cloudflare's Workers platform implements configurable rate-based load shedding at the edge. When incoming request rates exceed configured thresholds, excess requests are rejected with 429 responses at the edge PoP before they reach the origin server. This protects origin servers from traffic spikes -- whether from legitimate viral traffic or DDoS attacks -- and ensures that allowed requests receive full-quality responses without competing with excess load.
| Aspect | Description |
|---|---|
| Rejected Requests vs System Stability | Load shedding explicitly trades individual request success for system-wide stability. Some requests are intentionally failed so that the majority can be served well. This trade-off must be accepted at the organizational level -- it can be uncomfortable to intentionally return errors to users, but the alternative (all users experiencing poor performance) is worse. |
| Threshold Tuning Complexity | Setting the right shedding thresholds (CPU utilization, queue depth, request rate) requires understanding system capacity under various workloads. Too aggressive causes unnecessary rejections during normal traffic. Too conservative allows the system to enter overload before shedding activates. CoDel partially addresses this by being self-tuning, but simpler strategies require careful calibration. |
| Priority Classification Overhead | Priority-based shedding requires classifying every request type by priority, which adds metadata to requests and decision logic to the shedding layer. Priority misclassification can shed important traffic while preserving low-value traffic. Regular review of priority assignments is necessary as the system evolves and new endpoints are added. |
| Client Experience During Shedding | Even with proper HTTP 429/503 responses and Retry-After headers, rejected users experience failures. If the overload is sustained, users may be shed repeatedly, leading to frustration. Combining load shedding with graceful degradation (serve a reduced response instead of a full rejection) can improve the experience for shed requests. |
Uber -- Priority-Based Load Shedding During New Year's Eve
Scenario
During New Year's Eve, Uber experiences some of the highest ride-request volumes of the year. Traffic spikes 5-10x above normal levels within a 30-minute window as partygoers request rides simultaneously. Without load management, the surge overwhelmed backend services: ride-matching, pricing, driver dispatch, and numerous downstream microservices all competed for the same compute resources, causing elevated latency and timeout errors across all functions, including critical ride-matching.
Solution
Uber implemented a multi-tier priority-based load shedding system. All request types were classified into priority bands: P0 (ride request and matching -- never shed), P1 (payment processing, driver dispatch -- shed last), P2 (real-time map updates, ETA calculations -- shed under heavy load), P3 (analytics, earnings history, promotional content -- shed first). During overload, the system progressively sheds lower-priority traffic, maintaining headroom for P0 operations. Shedding decisions are made at the API gateway layer based on real-time CPU and latency metrics from backend services.
Outcome
During subsequent New Year's Eve peaks, Uber's ride-matching success rate remained above 99% even when overall traffic exceeded 8x normal levels. P3 traffic (analytics, earnings) was fully shed during the 30-minute peak window, P2 traffic was partially shed, and P0/P1 traffic was served at normal latency. Users could still request and receive rides; they simply could not check their ride history or earnings during the busiest window. This was a far better outcome than the previous years when all operations degraded simultaneously.
See Load Shedding in action
Explore system design templates that use load shedding and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why is it better to reject 10% of requests during overload than to accept all requests?
2What is the CoDel (Controlled Delay) approach to load shedding?
3In priority-based load shedding, which requests should be shed first?