Vetora logo
🏋️Reliability & Resilience

Load Shedding

Load shedding intentionally rejects excess requests when a system is at or near capacity, ensuring that the requests it does process are served well. By proactively dropping traffic before the system becomes overloaded, load shedding prevents the degraded performance that affects all requests during overload.

Overview

Load shedding is the practice of intentionally rejecting incoming requests when a system is at or approaching its capacity limits. The fundamental insight is counterintuitive: it is better to reject some requests outright and serve the remaining requests well than to accept all requests and serve them all poorly. Without load shedding, an overloaded system exhibits a characteristic failure pattern: as load increases beyond capacity, response latency increases for all requests, timeouts begin firing, clients retry (adding more load), latency increases further, and the system enters a death spiral where it spends most of its resources on requests it will never finish. Load shedding breaks this pattern by maintaining a clear boundary between 'in capacity' (serve well) and 'over capacity' (reject fast).

The simplest load shedding strategy is random rejection: when the system detects overload (CPU > 80%, queue depth > threshold), it randomly rejects a percentage of incoming requests with HTTP 503 or 429, along with a Retry-After header. This is fair -- no client is systematically penalized -- but undiscriminating: a critical checkout request and a low-priority analytics ping are equally likely to be rejected. Priority-based shedding improves on this by assigning priority levels to different request types and shedding from the lowest priority first. Batch jobs are shed before search queries, search queries before product browsing, and product browsing before checkout. This ensures the highest-value user journeys remain functional even during overload.

More sophisticated approaches use queue theory to make shedding decisions. LIFO (Last In, First Out) queue processing ensures that when the queue is deep, the oldest requests (which have likely already timed out on the client side) are the ones discarded, while fresh requests are processed first. This avoids wasting compute on requests whose callers have already moved on. CoDel (Controlled Delay), developed at Google and inspired by networking research, takes a queue-aware approach: it monitors how long requests spend waiting in the queue, and when the minimum queue wait time exceeds a target (e.g., 5ms) for a sustained period, it begins dropping requests that have waited too long. CoDel is self-tuning -- it adapts to the system's actual capacity rather than requiring manually configured thresholds.

Effective load shedding requires client cooperation. The server should return machine-readable rejection responses: HTTP 429 (Too Many Requests) with a Retry-After header telling the client when to try again, or HTTP 503 (Service Unavailable) for transient overload. Well-behaved clients respect the Retry-After header and implement exponential backoff with jitter. Without client cooperation, rejected requests are immediately retried, adding to the load and defeating the purpose of shedding. The client-server contract around load shedding -- 429/503 responses, Retry-After headers, and backoff strategies -- is as important as the shedding algorithm itself.

Key Points
  • 1Load shedding rejects excess requests to protect system stability. It is better to serve 90% of requests well than 100% poorly. Without shedding, overload causes latency spikes for all requests, triggering retries that further increase load.
  • 2Random rejection is the simplest strategy: reject a percentage of requests when overloaded. It is fair but does not differentiate between critical and non-critical traffic. Use as a baseline when request priority is unknown.
  • 3Priority-based shedding rejects low-priority traffic first. Classify requests by business value: batch jobs before search, search before browsing, browsing before checkout. This keeps the highest-value journeys functional during overload.
  • 4CoDel (Controlled Delay) monitors queue wait times and drops requests that have waited too long. It is self-tuning and adapts to actual system capacity. Google uses CoDel-based load shedding in their Borg cluster manager.
  • 5LIFO queue processing discards the oldest requests first. Since those requests have waited the longest, their callers have likely already timed out and retried. Processing them wastes compute on responses nobody will use.
  • 6Client cooperation is essential. Return HTTP 429 with Retry-After headers so clients know when to retry. Without proper signaling, rejected clients retry immediately, adding to the load and creating a retry storm.
Simple Example

The Nightclub Bouncer Analogy

A nightclub has a fire-code capacity of 500 people. When 500 people are inside, the bouncer stops letting people in -- even if there is a long line. This is load shedding. The bouncer is not being mean; they are preventing overcrowding that would make the experience terrible for everyone inside and potentially dangerous. If the bouncer let everyone in, the club would be uncomfortably packed, the bar would be overwhelmed, and nobody would have a good time. By maintaining a capacity boundary, the 500 people inside have a great experience, and people in line know they need to wait rather than being packed into an unpleasant situation.

Real-World Examples

Google

Google uses CoDel-based load shedding in their Borg cluster manager to protect services from overload. CoDel monitors the minimum queue wait time over sliding intervals. When the minimum exceeds a target (typically 5ms) for a sustained period, it begins dropping requests that have queued for too long. This approach is self-tuning: it adapts to the system's actual capacity without requiring manually configured thresholds, and it naturally backs off when the overload subsides.

Uber

Uber implements priority-based load shedding to protect their ride-matching pipeline. During peak demand, non-critical requests like driver earnings history, trip analytics, and promotional notifications are shed first. Ride requests, the core business function, receive the highest priority and are shed last. This ensures that even during extreme load events (New Year's Eve, major sporting events), users can still request and receive rides, even if some peripheral features are temporarily unavailable.

Cloudflare

Cloudflare's Workers platform implements configurable rate-based load shedding at the edge. When incoming request rates exceed configured thresholds, excess requests are rejected with 429 responses at the edge PoP before they reach the origin server. This protects origin servers from traffic spikes -- whether from legitimate viral traffic or DDoS attacks -- and ensures that allowed requests receive full-quality responses without competing with excess load.

Trade-Offs
AspectDescription
Rejected Requests vs System StabilityLoad shedding explicitly trades individual request success for system-wide stability. Some requests are intentionally failed so that the majority can be served well. This trade-off must be accepted at the organizational level -- it can be uncomfortable to intentionally return errors to users, but the alternative (all users experiencing poor performance) is worse.
Threshold Tuning ComplexitySetting the right shedding thresholds (CPU utilization, queue depth, request rate) requires understanding system capacity under various workloads. Too aggressive causes unnecessary rejections during normal traffic. Too conservative allows the system to enter overload before shedding activates. CoDel partially addresses this by being self-tuning, but simpler strategies require careful calibration.
Priority Classification OverheadPriority-based shedding requires classifying every request type by priority, which adds metadata to requests and decision logic to the shedding layer. Priority misclassification can shed important traffic while preserving low-value traffic. Regular review of priority assignments is necessary as the system evolves and new endpoints are added.
Client Experience During SheddingEven with proper HTTP 429/503 responses and Retry-After headers, rejected users experience failures. If the overload is sustained, users may be shed repeatedly, leading to frustration. Combining load shedding with graceful degradation (serve a reduced response instead of a full rejection) can improve the experience for shed requests.
Case Study

Uber -- Priority-Based Load Shedding During New Year's Eve

Scenario

During New Year's Eve, Uber experiences some of the highest ride-request volumes of the year. Traffic spikes 5-10x above normal levels within a 30-minute window as partygoers request rides simultaneously. Without load management, the surge overwhelmed backend services: ride-matching, pricing, driver dispatch, and numerous downstream microservices all competed for the same compute resources, causing elevated latency and timeout errors across all functions, including critical ride-matching.

Solution

Uber implemented a multi-tier priority-based load shedding system. All request types were classified into priority bands: P0 (ride request and matching -- never shed), P1 (payment processing, driver dispatch -- shed last), P2 (real-time map updates, ETA calculations -- shed under heavy load), P3 (analytics, earnings history, promotional content -- shed first). During overload, the system progressively sheds lower-priority traffic, maintaining headroom for P0 operations. Shedding decisions are made at the API gateway layer based on real-time CPU and latency metrics from backend services.

Outcome

During subsequent New Year's Eve peaks, Uber's ride-matching success rate remained above 99% even when overall traffic exceeded 8x normal levels. P3 traffic (analytics, earnings) was fully shed during the 30-minute peak window, P2 traffic was partially shed, and P0/P1 traffic was served at normal latency. Users could still request and receive rides; they simply could not check their ride history or earnings during the busiest window. This was a far better outcome than the previous years when all operations degraded simultaneously.

Common Mistakes
  • Not implementing load shedding at all, relying on the system to 'handle the load.' Every system has a capacity limit. Without load shedding, exceeding that limit degrades all requests rather than protecting the majority by rejecting the excess.
  • Shedding requests without proper HTTP status codes and Retry-After headers. Returning generic 500 errors causes clients to retry immediately, adding to the load. Use 429 (Too Many Requests) or 503 (Service Unavailable) with a Retry-After header to coordinate client behavior.
  • Using only random shedding without priority differentiation. If checkout requests and analytics pings are equally likely to be shed, you may reject revenue-generating traffic while preserving low-value requests. Classify requests by business value and shed lowest priority first.
  • Setting static shedding thresholds without monitoring. System capacity changes with code deployments, infrastructure changes, and traffic pattern shifts. Static thresholds become stale. Use adaptive approaches like CoDel or regularly recalibrate thresholds based on observed capacity.
Related Concepts

See Load Shedding in action

Explore system design templates that use load shedding and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Shed excess load to protect core checkout during traffic spikes

Metrics to watch
shed_request_pctaccepted_throughput_rpsp99_latency_mserror_rate_pct
Run Simulation
Test Your Understanding

1Why is it better to reject 10% of requests during overload than to accept all requests?

2What is the CoDel (Controlled Delay) approach to load shedding?

3In priority-based load shedding, which requests should be shed first?

Deeper Reading