Vetora logo
🚦Reliability & Resilience

Rate Limiting

Rate limiting controls the number of requests a client can make within a time window, protecting services from abuse, ensuring fair resource usage, and preventing resource exhaustion. Algorithms range from simple fixed window counters to sophisticated token bucket and sliding window approaches, each with distinct trade-offs for burst handling, memory usage, and accuracy.

Overview

Rate limiting is a critical mechanism for controlling the rate at which clients consume API resources. It serves three primary purposes: protecting against abuse (preventing malicious or buggy clients from overwhelming the service), ensuring fair usage (preventing one client from monopolizing shared resources), and preventing resource exhaustion (bounding the total request rate to what the system can handle). Every production API should implement rate limiting -- without it, a single client can consume all available capacity, a bug in a client library can generate unbounded requests, and denial-of-service attacks can take down the entire service.

The five fundamental rate limiting algorithms each offer different trade-offs. The fixed window counter divides time into fixed windows (e.g., 1-minute intervals) and counts requests per window. It is simple to implement (a single counter per client per window) but has the boundary burst problem: a client can send its full quota at the end of one window and the beginning of the next, effectively doubling the rate. The sliding window log tracks the exact timestamp of every request and counts requests within a trailing window. It is perfectly accurate but requires storing every timestamp, consuming memory proportional to the rate limit times the window size. The sliding window counter is a hybrid that combines the fixed window counter of the current and previous windows, weighted by their overlap with the sliding window. It eliminates the boundary burst problem with minimal memory overhead -- just two counters per client.

The token bucket algorithm is the most widely deployed rate limiting mechanism. A bucket holds tokens, refilled at a constant rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity (burst size), allowing clients to send short bursts up to the bucket size as long as they have accumulated tokens. This is ideal for APIs that need to allow occasional bursts while maintaining a sustained rate limit. The leaky bucket works inversely: requests enter a queue (bucket) and are processed at a constant rate. If the queue is full, new requests are rejected. The leaky bucket produces the smoothest output rate but does not allow any bursting, which can be too restrictive for interactive APIs.

In distributed systems, rate limiting must be coordinated across multiple server instances. The standard approach uses Redis as a centralized counter store: each server checks and increments the client's counter in Redis before processing the request. Redis INCR and EXPIRE operations provide atomic counter manipulation, and Lua scripts ensure that the check-and-increment is atomic. Rate limits are typically organized in tiers: per-IP (protecting against anonymous abuse), per-API-key or per-user (ensuring fair usage among authenticated clients), per-endpoint (protecting expensive operations), and global (bounding total system load). Rate limit responses use standardized HTTP headers: X-RateLimit-Limit (the limit), X-RateLimit-Remaining (remaining requests), X-RateLimit-Reset (when the limit resets), and Retry-After (how long to wait before retrying). Enforcement happens at the API gateway (centralized, language-agnostic), application middleware (per-service, customizable), or service mesh sidecar (infrastructure-level, no code changes).

Key Points
  • 1Fixed window counter is simplest but has the boundary burst problem: a client can double their effective rate by clustering requests at window boundaries. Suitable for approximate rate limiting where boundary precision is not critical.
  • 2Token bucket is the most popular algorithm: it allows controlled bursts (up to bucket capacity) while enforcing a sustained rate limit (token refill rate). AWS API Gateway, Stripe, and most API providers use token bucket.
  • 3Sliding window counter eliminates the boundary burst problem by weighting requests from the current and previous windows. It provides good accuracy with minimal memory (two counters per client), making it the best choice for most use cases.
  • 4Distributed rate limiting uses Redis (INCR + EXPIRE or Lua scripts) for centralized counting across multiple server instances. Without centralized coordination, each server enforces its own limit, effectively multiplying the total limit by the number of servers.
  • 5Rate limits should be tiered: per-IP (abuse prevention), per-user/API-key (fair usage), per-endpoint (protect expensive operations), and global (system capacity protection). Each tier catches different types of overuse.
  • 6Standard HTTP headers communicate rate limit status: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After (on 429 responses). These enable well-behaved clients to self-throttle before hitting limits.
Simple Example

The Highway On-Ramp Meter Analogy

A highway on-ramp meter (traffic light at the ramp) is a rate limiter. It controls how many cars enter the highway per minute to prevent congestion. Without the meter, all cars merge at once, causing stop-and-go traffic that slows everyone down. The meter releases cars at a controlled rate (token bucket -- each green light is a token), allowing the highway to flow smoothly. During rush hour, the meter is stricter (lower rate limit); during off-peak, it is more lenient. Cars waiting at the meter experience a brief delay, but the overall highway throughput is higher than if all cars merged freely and caused gridlock.

Real-World Examples

Stripe

Stripe implements sophisticated multi-tier rate limiting to protect their payment processing infrastructure. Rate limits are applied per-API-key with burst allowances using token bucket. Standard accounts receive 100 requests per second with a burst of 200. Higher tiers are available for high-volume merchants. Stripe returns X-RateLimit-Limit and X-RateLimit-Remaining headers on every response, and 429 responses include a Retry-After header with jitter to prevent synchronized retries from rate-limited clients.

GitHub API

GitHub enforces rate limits at two tiers: authenticated requests are limited to 5000 per hour per user, while unauthenticated requests are limited to 60 per hour per IP address. This tiered approach protects against anonymous abuse (the 60/hr limit makes automated scraping impractical) while giving authenticated users generous limits for legitimate API consumption. GitHub provides X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers on every response.

Cloudflare

Cloudflare implements rate limiting at the edge, across their global network of 300+ points of presence. Rate limit rules are configured per zone (domain) and can match on URL path, HTTP method, request headers, or response codes. Because enforcement happens at the edge, malicious traffic is rejected before it reaches the origin server. Cloudflare supports both simple threshold-based rate limiting and more sophisticated challenge-based limiting that presents CAPTCHAs to suspicious traffic before blocking.

Trade-Offs
AspectDescription
Accuracy vs Memory UsageSliding window log provides perfect accuracy but stores every request timestamp, consuming O(rate_limit * window_size) memory per client. Sliding window counter uses only two counters per client (O(1) memory) with slightly less precision. Token bucket needs only two values (token count, last refill time). For millions of clients, memory-efficient algorithms are essential.
Burst Handling vs SmoothnessToken bucket allows controlled bursts (up to bucket capacity), which is user-friendly for interactive APIs. Leaky bucket enforces perfectly smooth output, which protects backends better but frustrates clients that send natural bursts. Fixed window allows uncontrolled boundary bursts. The choice depends on whether the API or the backend is the bottleneck.
Centralized vs Distributed EnforcementCentralized rate limiting (via Redis) provides accurate global limits but adds latency (Redis round trip per request) and introduces a single point of failure. Local rate limiting (per-server) avoids the Redis dependency but effectively multiplies limits by server count and cannot enforce global limits. Hybrid approaches use local rate limiting with periodic Redis synchronization.
Strictness vs User ExperienceStrict rate limiting (hard reject at the limit) is predictable but can frustrate legitimate users during traffic spikes. Soft rate limiting (allow brief overages with warnings) is friendlier but less predictable for capacity planning. Implementing graduated responses (slow down before rejecting) provides a better experience but adds implementation complexity.
Case Study

Stripe -- Multi-Tier Rate Limiting for Payment API Protection

Scenario

Stripe processes millions of API requests per second for payment processing, subscription management, and account operations. Without rate limiting, a single merchant with a buggy integration could flood the API with millions of requests, consuming capacity meant for thousands of other merchants. Similarly, fraudulent actors could probe the API with high volumes of card testing requests. Stripe needed rate limiting that protected the platform while providing generous limits for legitimate high-volume merchants.

Solution

Stripe implemented multi-tier token bucket rate limiting. Each API key receives a sustained rate limit (100 RPS standard) with a burst allowance (200 requests). Enterprise merchants receive higher limits negotiated in their contracts. Rate limits are enforced per-endpoint: read endpoints have higher limits than write endpoints, and particularly expensive operations (e.g., large data exports) have lower limits. Stripe's rate limiter uses Redis with Lua scripts for atomic token bucket operations, ensuring accurate distributed enforcement across their global infrastructure. Every response includes rate limit headers, and 429 responses include Retry-After with jitter.

Outcome

Stripe's multi-tier rate limiting eliminated platform-wide impact from individual merchant traffic spikes. Card testing attacks are automatically throttled at the per-IP tier before they can consume significant resources. Legitimate high-volume merchants operate within their generous limits without impacting other merchants. The transparent rate limit headers allow well-behaved client libraries (including Stripe's official SDKs) to implement client-side throttling, reducing 429 responses to near-zero for most merchants.

Common Mistakes
  • Not implementing rate limiting at all, assuming the system can handle unlimited traffic. Every production API needs rate limiting. Without it, a single buggy client, a bot, or a DDoS attack can consume all available capacity.
  • Using fixed window counters for precise rate limiting. The boundary burst problem means clients can send 2x the intended rate by clustering requests at window boundaries. Use sliding window counter or token bucket for accurate enforcement.
  • Rate limiting per-server instead of globally. If the limit is 100 RPS and there are 10 servers, each enforcing 100 RPS locally, the effective limit is 1000 RPS. Use centralized counting (Redis) to enforce accurate global limits.
  • Not returning standard rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset). Without these headers, clients cannot self-throttle and must discover limits through trial and error, leading to unnecessary 429 errors and a poor developer experience.
Related Concepts

See Rate Limiting in action

Explore system design templates that use rate limiting and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate token bucket vs sliding window rate limiting under burst traffic

Metrics to watch
rejected_requests_pctthroughput_rpsp99_latency_ms
Run Simulation
Test Your Understanding

1What is the boundary burst problem in fixed window rate limiting?

2How does the token bucket algorithm handle burst traffic?

3Why should distributed rate limiting use centralized counting (e.g., Redis) instead of per-server counters?

Deeper Reading