Vetora logo
🔁Reliability & Resilience

Retry with Exponential Backoff and Jitter

Retrying transient failures with exponential backoff avoids overwhelming a recovering service by increasing wait times between attempts. Adding jitter randomizes retry timing to prevent thundering herd problems where all clients retry simultaneously. Combined with idempotency and retry budgets, this pattern is fundamental to reliable distributed communication.

Overview

Transient failures are a fact of life in distributed systems. Network packets are dropped, TCP connections are reset, servers return 503 during rolling deployments, and garbage collection pauses cause temporary unresponsiveness. These failures are temporary by nature -- the same request that failed will likely succeed if sent again a moment later. Retrying is the correct response to transient failures, but naive retries (immediately retry as fast as possible) can cause more harm than good by overwhelming a recovering service with a flood of retry traffic, turning a brief hiccup into a prolonged outage.

Exponential backoff addresses this by increasing the wait time between retry attempts: wait 1 second after the first failure, 2 seconds after the second, 4 seconds after the third, and so on. This gives the failing service progressively more time to recover between waves of retries. However, exponential backoff alone has a critical flaw: if 1000 clients all experience a failure at the same time (because the downstream service restarted), they all retry at exactly the same intervals -- 1s, 2s, 4s -- creating synchronized waves of traffic that hit the recovering service simultaneously. This thundering herd effect can prevent recovery entirely.

Jitter solves the thundering herd by adding randomness to the retry interval. There are several jitter strategies. Full jitter randomizes the wait time between 0 and the exponential backoff value (random(0, base * 2^attempt)), spreading retries uniformly across the entire window. Equal jitter uses half the backoff value as a fixed minimum plus a random component (base * 2^attempt / 2 + random(0, base * 2^attempt / 2)), providing a guaranteed minimum wait. Decorrelated jitter (recommended by AWS) calculates each wait independently using the previous wait value (min(cap, random(base, previous_wait * 3))), producing the most spread-out retry distribution. AWS's analysis shows decorrelated jitter results in the fewest total calls needed for all clients to eventually succeed.

Retrying safely requires two additional mechanisms: idempotency and retry budgets. Idempotency ensures that if a request is sent twice, the side effect occurs only once. GET and DELETE operations are naturally idempotent. For non-idempotent operations like payment charges, clients should include an idempotency key (a unique identifier per logical operation) so the server can detect and deduplicate retried requests. Stripe, for example, supports idempotency keys on all POST endpoints -- if a payment request times out and the client retries with the same key, Stripe returns the original response instead of charging the customer twice. Retry budgets limit the total fraction of traffic that consists of retries across the entire fleet (for example, no more than 10% of all requests are retries). This prevents retry amplification: if every client retries 3 times, a 10% failure rate generates 30% additional traffic, which can push the system past its capacity and increase the failure rate further, creating a death spiral.

Key Points
  • 1Exponential backoff increases wait time between retries (1s, 2s, 4s, 8s...) to give recovering services progressively more breathing room. Always set a maximum retry count and a maximum delay cap to bound total retry duration.
  • 2Jitter adds randomness to retry timing to prevent thundering herd. Without jitter, all clients that experienced a simultaneous failure retry at the same moments, creating traffic spikes that prevent recovery.
  • 3Full jitter (random(0, base * 2^attempt)) spreads retries most evenly. Decorrelated jitter (AWS recommendation) minimizes total calls needed for all clients to succeed. Equal jitter guarantees a minimum wait between retries.
  • 4Only retry idempotent operations safely. GET, PUT, and DELETE are typically safe. For non-idempotent operations (payments, order creation), use idempotency keys so the server can deduplicate retried requests and avoid double-processing.
  • 5Retry budgets limit total retries across the fleet (e.g., max 10% of requests are retries). This prevents retry amplification where retries increase server load, causing more failures, causing more retries, creating a death spiral.
  • 6Only retry transient errors (503, 429, connection reset, timeout). Never retry client errors (400, 401, 403, 404) or permanent server errors -- these will fail on every attempt and waste resources.
Simple Example

The Phone Call Analogy

You call a friend and get a busy signal (transient failure). If you immediately redial continuously, you are just adding to the congestion. Instead, you wait 1 minute and try again (first backoff). Still busy? Wait 2 minutes (exponential backoff). Now imagine 50 people are all trying to call the same person and all got busy at the same time. If they all wait exactly 1 minute, they all call back simultaneously and get busy again. If each person waits a random time between 0 and 2 minutes (jitter), the calls are spread out and have a much better chance of getting through. And you set a limit: you will try at most 5 times before leaving a voicemail (max retries).

Real-World Examples

AWS SDK

The AWS SDK uses full jitter exponential backoff by default for all API calls. After extensive analysis published in the AWS Architecture Blog, AWS determined that full jitter (random(0, base * 2^attempt)) produces the best balance of retry spread and completion time. The SDK also implements max retry limits (default 3 for standard, 5 for adaptive mode) and supports configurable base delays and maximum delay caps. This retry strategy is applied uniformly across all AWS services, from S3 to DynamoDB to Lambda.

Stripe

Stripe supports idempotency keys on all POST endpoints to enable safe retries of payment operations. When a client includes an Idempotency-Key header, Stripe stores the result of the first request and returns it for any subsequent request with the same key within 24 hours. This allows clients to safely retry a payment charge that timed out without risking double-charging the customer. Stripe recommends using UUIDs as idempotency keys and implementing exponential backoff for retries.

Google

Google implements adaptive retry budgets across their fleet to prevent retry amplification. Instead of a fixed retry count per client, the retry budget limits total retries as a percentage of successful requests. If a service is handling 1000 requests per second and the retry budget is 10%, a maximum of 100 retries per second are allowed across all clients. When the retry budget is exhausted, additional failures are surfaced to callers without retrying, preventing the retry death spiral that can turn a partial outage into a complete one.

Trade-Offs
AspectDescription
Retry Latency vs Recovery SpeedAggressive retries (short backoff, many attempts) detect recovery faster but risk overwhelming a recovering service. Conservative retries (long backoff, few attempts) give the service more recovery time but increase latency for requests that would succeed on retry. The backoff parameters must balance user-facing latency requirements with downstream service capacity.
Jitter Spread vs Minimum WaitFull jitter (random(0, max)) provides maximum spread but can produce very short waits (near zero), potentially hammering the service. Equal jitter guarantees a minimum wait of half the backoff value, providing more predictable spacing at the cost of slightly less spread. The choice depends on how sensitive the downstream service is to burst traffic.
Idempotency Implementation CostSupporting idempotency keys requires server-side storage of request results keyed by the idempotency token, deduplication logic, and expiration policies. This adds storage overhead and implementation complexity. Without idempotency, retries of non-idempotent operations risk duplicate side effects -- double charges, duplicate orders, or repeated notifications.
Retry Budget CoordinationFleet-wide retry budgets require coordination: each client must know the total retry budget and its share, or a centralized system must track retry rates. This is straightforward with a service mesh (Envoy can enforce retry budgets) but complex to implement in application-level retry logic, especially across services written in different languages.
Case Study

AWS Exponential Backoff Analysis -- Choosing the Right Jitter Strategy

Scenario

AWS engineers observed that during service disruptions, client retries often prolonged outages rather than helping. When a DynamoDB partition became temporarily unavailable, thousands of clients would retry simultaneously with identical exponential backoff timing. These synchronized retry waves hit the recovering partition with traffic spikes that exceeded its capacity, preventing it from stabilizing. The team needed to determine which jitter strategy would minimize total recovery time while ensuring all clients eventually succeed.

Solution

AWS conducted a detailed simulation comparing three jitter strategies across thousands of clients: no jitter (pure exponential backoff), full jitter (random(0, base * 2^attempt)), and decorrelated jitter (min(cap, random(base, previous_wait * 3))). They published the results in the AWS Architecture Blog. Full jitter spread retries most evenly, reducing peak retry load by 75% compared to no jitter. Decorrelated jitter produced the lowest total number of calls needed for all clients to succeed. AWS adopted full jitter as the default in all AWS SDKs and recommended it as the standard approach for all AWS customers.

Outcome

After deploying full jitter across AWS SDKs, the average duration of retry-induced overload during service disruptions decreased significantly. Client-side error rates during partial outages dropped because retries were spread out enough for the recovering service to handle them. The AWS Architecture Blog post on exponential backoff and jitter became one of the most referenced articles in distributed systems engineering, and the full jitter strategy is now considered an industry best practice.

Common Mistakes
  • Retrying without backoff. Immediate retries at full speed can turn a transient failure into a sustained outage by overwhelming the recovering service with retry traffic. Always implement exponential backoff to give the service progressively more time to recover.
  • Retrying without jitter. Even with exponential backoff, all clients that experienced a failure at the same time will retry at the same intervals, creating synchronized traffic spikes. Always add jitter to spread retries across the backoff window.
  • Retrying non-idempotent operations without idempotency keys. Retrying a payment charge without an idempotency key risks double-charging the customer. Always ensure operations are idempotent before retrying, either inherently (GET, DELETE) or via idempotency keys.
  • Not implementing a retry budget. Without a fleet-wide cap on retries, a 10% failure rate with 3 retries per client generates 30% additional traffic. If this pushes the server past capacity, failures increase, triggering more retries -- a classic death spiral. Limit total retries to a percentage of successful traffic (e.g., 10%).
Related Concepts

See Retry with Exponential Backoff and Jitter in action

Explore system design templates that use retry with exponential backoff and jitter and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Compare retry strategies with and without jitter

Metrics to watch
retry_storm_rpssuccess_after_retry_pctp99_latency_msserver_load_pct
Run Simulation
Test Your Understanding

1Why is jitter added to exponential backoff in retry logic?

2What is a retry budget and why is it important?

3When is it safe to retry a failed HTTP POST request?

Deeper Reading