1What is the boundary burst problem in fixed window rate limiting?
Rate limiting controls the number of requests a client can make within a time window, protecting services from abuse, ensuring fair resource usage, and preventing resource exhaustion. Algorithms range from simple fixed window counters to sophisticated token bucket and sliding window approaches, each with distinct trade-offs for burst handling, memory usage, and accuracy.
Rate limiting is a critical mechanism for controlling the rate at which clients consume API resources. It serves three primary purposes: protecting against abuse (preventing malicious or buggy clients from overwhelming the service), ensuring fair usage (preventing one client from monopolizing shared resources), and preventing resource exhaustion (bounding the total request rate to what the system can handle). Every production API should implement rate limiting -- without it, a single client can consume all available capacity, a bug in a client library can generate unbounded requests, and denial-of-service attacks can take down the entire service.
The five fundamental rate limiting algorithms each offer different trade-offs. The fixed window counter divides time into fixed windows (e.g., 1-minute intervals) and counts requests per window. It is simple to implement (a single counter per client per window) but has the boundary burst problem: a client can send its full quota at the end of one window and the beginning of the next, effectively doubling the rate. The sliding window log tracks the exact timestamp of every request and counts requests within a trailing window. It is perfectly accurate but requires storing every timestamp, consuming memory proportional to the rate limit times the window size. The sliding window counter is a hybrid that combines the fixed window counter of the current and previous windows, weighted by their overlap with the sliding window. It eliminates the boundary burst problem with minimal memory overhead -- just two counters per client.
The token bucket algorithm is the most widely deployed rate limiting mechanism. A bucket holds tokens, refilled at a constant rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity (burst size), allowing clients to send short bursts up to the bucket size as long as they have accumulated tokens. This is ideal for APIs that need to allow occasional bursts while maintaining a sustained rate limit. The leaky bucket works inversely: requests enter a queue (bucket) and are processed at a constant rate. If the queue is full, new requests are rejected. The leaky bucket produces the smoothest output rate but does not allow any bursting, which can be too restrictive for interactive APIs.
In distributed systems, rate limiting must be coordinated across multiple server instances. The standard approach uses Redis as a centralized counter store: each server checks and increments the client's counter in Redis before processing the request. Redis INCR and EXPIRE operations provide atomic counter manipulation, and Lua scripts ensure that the check-and-increment is atomic. Rate limits are typically organized in tiers: per-IP (protecting against anonymous abuse), per-API-key or per-user (ensuring fair usage among authenticated clients), per-endpoint (protecting expensive operations), and global (bounding total system load). Rate limit responses use standardized HTTP headers: X-RateLimit-Limit (the limit), X-RateLimit-Remaining (remaining requests), X-RateLimit-Reset (when the limit resets), and Retry-After (how long to wait before retrying). Enforcement happens at the API gateway (centralized, language-agnostic), application middleware (per-service, customizable), or service mesh sidecar (infrastructure-level, no code changes).
The Highway On-Ramp Meter Analogy
A highway on-ramp meter (traffic light at the ramp) is a rate limiter. It controls how many cars enter the highway per minute to prevent congestion. Without the meter, all cars merge at once, causing stop-and-go traffic that slows everyone down. The meter releases cars at a controlled rate (token bucket -- each green light is a token), allowing the highway to flow smoothly. During rush hour, the meter is stricter (lower rate limit); during off-peak, it is more lenient. Cars waiting at the meter experience a brief delay, but the overall highway throughput is higher than if all cars merged freely and caused gridlock.
Stripe
Stripe implements sophisticated multi-tier rate limiting to protect their payment processing infrastructure. Rate limits are applied per-API-key with burst allowances using token bucket. Standard accounts receive 100 requests per second with a burst of 200. Higher tiers are available for high-volume merchants. Stripe returns X-RateLimit-Limit and X-RateLimit-Remaining headers on every response, and 429 responses include a Retry-After header with jitter to prevent synchronized retries from rate-limited clients.
GitHub API
GitHub enforces rate limits at two tiers: authenticated requests are limited to 5000 per hour per user, while unauthenticated requests are limited to 60 per hour per IP address. This tiered approach protects against anonymous abuse (the 60/hr limit makes automated scraping impractical) while giving authenticated users generous limits for legitimate API consumption. GitHub provides X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers on every response.
Cloudflare
Cloudflare implements rate limiting at the edge, across their global network of 300+ points of presence. Rate limit rules are configured per zone (domain) and can match on URL path, HTTP method, request headers, or response codes. Because enforcement happens at the edge, malicious traffic is rejected before it reaches the origin server. Cloudflare supports both simple threshold-based rate limiting and more sophisticated challenge-based limiting that presents CAPTCHAs to suspicious traffic before blocking.
| Aspect | Description |
|---|---|
| Accuracy vs Memory Usage | Sliding window log provides perfect accuracy but stores every request timestamp, consuming O(rate_limit * window_size) memory per client. Sliding window counter uses only two counters per client (O(1) memory) with slightly less precision. Token bucket needs only two values (token count, last refill time). For millions of clients, memory-efficient algorithms are essential. |
| Burst Handling vs Smoothness | Token bucket allows controlled bursts (up to bucket capacity), which is user-friendly for interactive APIs. Leaky bucket enforces perfectly smooth output, which protects backends better but frustrates clients that send natural bursts. Fixed window allows uncontrolled boundary bursts. The choice depends on whether the API or the backend is the bottleneck. |
| Centralized vs Distributed Enforcement | Centralized rate limiting (via Redis) provides accurate global limits but adds latency (Redis round trip per request) and introduces a single point of failure. Local rate limiting (per-server) avoids the Redis dependency but effectively multiplies limits by server count and cannot enforce global limits. Hybrid approaches use local rate limiting with periodic Redis synchronization. |
| Strictness vs User Experience | Strict rate limiting (hard reject at the limit) is predictable but can frustrate legitimate users during traffic spikes. Soft rate limiting (allow brief overages with warnings) is friendlier but less predictable for capacity planning. Implementing graduated responses (slow down before rejecting) provides a better experience but adds implementation complexity. |
Stripe -- Multi-Tier Rate Limiting for Payment API Protection
Scenario
Stripe processes millions of API requests per second for payment processing, subscription management, and account operations. Without rate limiting, a single merchant with a buggy integration could flood the API with millions of requests, consuming capacity meant for thousands of other merchants. Similarly, fraudulent actors could probe the API with high volumes of card testing requests. Stripe needed rate limiting that protected the platform while providing generous limits for legitimate high-volume merchants.
Solution
Stripe implemented multi-tier token bucket rate limiting. Each API key receives a sustained rate limit (100 RPS standard) with a burst allowance (200 requests). Enterprise merchants receive higher limits negotiated in their contracts. Rate limits are enforced per-endpoint: read endpoints have higher limits than write endpoints, and particularly expensive operations (e.g., large data exports) have lower limits. Stripe's rate limiter uses Redis with Lua scripts for atomic token bucket operations, ensuring accurate distributed enforcement across their global infrastructure. Every response includes rate limit headers, and 429 responses include Retry-After with jitter.
Outcome
Stripe's multi-tier rate limiting eliminated platform-wide impact from individual merchant traffic spikes. Card testing attacks are automatically throttled at the per-IP tier before they can consume significant resources. Legitimate high-volume merchants operate within their generous limits without impacting other merchants. The transparent rate limit headers allow well-behaved client libraries (including Stripe's official SDKs) to implement client-side throttling, reducing 429 responses to near-zero for most merchants.
See Rate Limiting in action
Explore system design templates that use rate limiting and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the boundary burst problem in fixed window rate limiting?
2How does the token bucket algorithm handle burst traffic?
3Why should distributed rate limiting use centralized counting (e.g., Redis) instead of per-server counters?