Vetora logo
🚦Interview Toolkit

Interview Walkthrough: Rate Limiter

A detailed interview walkthrough for designing a distributed rate limiter. Covers four algorithms (fixed window, sliding window log, sliding window counter, token bucket), Redis-based implementation, race condition handling, and placement decisions.

Overview

A rate limiter controls the number of requests a client can make to a service within a given time window. It is a critical component for protecting services from abuse, preventing resource exhaustion, ensuring fair usage among clients, and defending against denial-of-service attacks. In a system design interview, the rate limiter question tests your ability to reason about distributed state, concurrency, and trade-offs between accuracy and performance.

Four algorithms dominate the rate limiting landscape. The fixed window counter divides time into fixed intervals (e.g., one-minute windows) and counts requests per window. It is simple to implement with Redis INCR and EXPIRE but suffers from the boundary problem: a client could send the maximum allowed requests at the end of one window and the beginning of the next, effectively doubling their rate for a brief period. The sliding window log stores the timestamp of every request and counts how many fall within the trailing window. It is perfectly accurate but consumes significant memory because every request timestamp must be stored. The sliding window counter is a hybrid that combines the current window's count with a weighted portion of the previous window's count, offering near-perfect accuracy with minimal memory. The token bucket maintains a bucket of tokens that refills at a steady rate; each request consumes a token, and requests are rejected when the bucket is empty. Token bucket is the most flexible algorithm because it naturally handles burst traffic -- the bucket accumulates tokens during idle periods that can be spent during bursts.

In a distributed deployment, the central challenge is maintaining accurate counts across multiple rate limiter instances. If your API runs on 10 servers, each must agree on the current request count for a given client. The standard solution is a centralized Redis instance (or cluster) that stores counters. For the fixed window counter, a Lua script atomically increments the counter and sets an expiry, ensuring that the increment-and-check operation is atomic even under high concurrency. For the sliding window log, a Redis sorted set stores timestamps, and a Lua script atomically adds the new timestamp, removes expired entries, and checks the set size. Race conditions are the primary concern: without atomic operations, two concurrent requests could both read the same count, both decide to allow, and both increment, exceeding the limit by one. Redis Lua scripting or Redis transactions (MULTI/EXEC) solve this by executing the read-check-increment as a single atomic operation.

Placement is another key decision. The rate limiter can live at the API gateway level (centralized, easy to manage, but adds latency to every request), at the application level (more granular control, but harder to enforce uniformly), or as a service mesh sidecar (transparent to the application, but requires service mesh infrastructure). Most large-scale systems use a combination: a coarse-grained rate limiter at the API gateway for DDoS protection and a fine-grained application-level limiter for per-user or per-endpoint quotas. HTTP response headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After) communicate rate limit status to clients, enabling them to self-throttle and avoid hitting limits.

Key Points
  • 1The fixed window counter is the simplest algorithm (Redis INCR + EXPIRE) but allows request bursts at window boundaries. A client could send 2x the allowed rate by timing requests across two adjacent windows.
  • 2The token bucket algorithm handles burst traffic gracefully by accumulating tokens during idle periods. It is the most commonly used algorithm in production systems including AWS API Gateway and Stripe.
  • 3Sliding window counter provides near-perfect accuracy by weighting the previous window's count: effective_count = prev_count * overlap_percentage + current_count. It uses minimal memory (two counters per client per window).
  • 4Redis Lua scripting is essential for atomicity in distributed rate limiters. Without atomic read-check-increment operations, concurrent requests can race past the limit before any of them update the counter.
  • 5Rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After) are critical for client experience. Well-behaved clients use these headers to self-throttle, reducing the number of rejected requests.
  • 6Multi-tier rate limiting applies different limits at different levels: per-IP at the edge (anti-DDoS), per-API-key at the gateway (usage quota), and per-endpoint at the application (resource protection).
Simple Example

The Nightclub Bouncer Analogy

A rate limiter works like a nightclub bouncer who counts people entering. The club has a capacity of 100 people per hour. The bouncer uses a clicker counter and resets it every hour (fixed window). If 100 people have entered this hour, the next person is told to wait (429 Too Many Requests) or come back later (Retry-After header). A more sophisticated bouncer might use a rolling count -- checking how many people entered in the last 60 minutes regardless of clock boundaries (sliding window). A token bucket approach would let the bouncer allow a short burst (10 people enter at once) as long as the average rate stays under 100 per hour.

Real-World Examples

Stripe

Stripe implements a sophisticated multi-tier rate limiting system. Their API enforces per-key rate limits (100 requests per second for live mode, 25 for test mode) using the token bucket algorithm. Rate limits vary by endpoint: payment creation has stricter limits than payment retrieval. Stripe returns detailed rate limit headers and uses a generous burst allowance so that legitimate traffic spikes do not cause unnecessary failures.

GitHub

GitHub's REST API enforces 5,000 requests per hour for authenticated users and 60 per hour for unauthenticated requests. Their GraphQL API uses a point-based system where different query complexities consume different numbers of points from a 5,000-point hourly budget. This prevents a single expensive query from consuming the same quota as a simple one, demonstrating cost-aware rate limiting.

Cloudflare

Cloudflare provides edge-level rate limiting that operates at their global CDN points of presence. Rules can be configured per URL pattern, HTTP method, and client IP. Because the rate limiting runs at the edge, malicious traffic is blocked before it reaches the origin server. Cloudflare uses an approximate counting algorithm optimized for distributed edge execution where strict global consistency is impractical.

Trade-Offs
AspectDescription
Accuracy vs MemoryThe sliding window log provides perfect accuracy but stores every request timestamp (O(n) memory per client). The fixed window counter uses O(1) memory but allows boundary bursts. The sliding window counter offers a practical middle ground with O(1) memory and near-perfect accuracy.
Centralized vs Distributed CountingA centralized Redis counter provides exact global counts but adds a network hop to every request and creates a single point of failure. Local per-node counters avoid the network hop but can allow up to N times the rate limit across N nodes if not synchronized.
Hard vs Soft Rate LimitsHard rate limits strictly reject requests above the threshold, providing predictable resource protection but potentially dropping legitimate traffic during bursts. Soft rate limits allow temporary bursts above the threshold (with degraded priority or queuing), improving user experience but risking resource overload during sustained abuse.
Gateway vs Application PlacementGateway-level rate limiting is centralized and easy to manage but cannot enforce fine-grained per-endpoint or per-user limits. Application-level rate limiting provides granular control but must be consistently implemented across all services, creating a maintenance burden.
Case Study

Stripe's Token Bucket Rate Limiter at Scale

Scenario

Stripe processes billions of API requests per month from hundreds of thousands of merchants. Early in their growth, they used a simple fixed window counter per API key. During flash sale events, merchants would burst through their rate limits at window boundaries, causing cascading failures in downstream payment processing services. Legitimate traffic was being rejected alongside abusive traffic because the fixed window algorithm could not distinguish between a brief spike and sustained overload.

Solution

Stripe migrated to a token bucket algorithm backed by Redis. Each API key has a bucket that refills at a steady rate (e.g., 100 tokens per second) with a maximum burst capacity (e.g., 500 tokens). This allows merchants to burst up to 5x their sustained rate for short periods without being throttled. Stripe implemented the token bucket logic as a Redis Lua script to ensure atomicity across their distributed API server fleet. They added multiple rate limit tiers: a global per-key limit, per-endpoint limits for expensive operations (payment creation vs retrieval), and an emergency circuit-breaker limit at the load balancer level for DDoS protection.

Outcome

The token bucket migration reduced false-positive rate limit rejections by 60% because legitimate burst traffic was accommodated within the burst allowance. Payment processing reliability improved because downstream services were protected from sustained overload. The per-endpoint limits prevented a single merchant's heavy usage of an expensive endpoint from affecting the quota for their other API calls. Redis Lua scripting maintained sub-millisecond rate check latency even at peak traffic.

Common Mistakes
  • Implementing rate limiting only at the application level without gateway-level protection. Application-level rate limiting cannot protect against volumetric DDoS attacks that overwhelm the network layer before requests reach the application code.
  • Using a non-atomic read-then-increment pattern in a distributed system. Two concurrent requests can both read the same counter value, both decide to allow, and both increment, allowing requests to exceed the limit. Always use atomic operations (Redis Lua scripts or MULTI/EXEC).
  • Forgetting to return rate limit headers in the response. Without X-RateLimit-Remaining and Retry-After headers, clients cannot self-throttle and will blindly retry, creating a thundering herd effect when limits reset.
  • Applying the same rate limit to all endpoints regardless of cost. A lightweight health check endpoint and a heavyweight data export endpoint should have different rate limits based on the server resources each request consumes.
Related Concepts

See Interview Walkthrough: Rate Limiter in action

Explore system design templates that use interview walkthrough: rate limiter and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate rate limiter behavior under burst traffic

Metrics to watch
rejected_requests_pctp99_latency_msthroughput_rps
Run Simulation
Test Your Understanding

1What is the main weakness of the fixed window counter algorithm?

2Why is Redis Lua scripting preferred over separate GET and SET commands for distributed rate limiting?

Deeper Reading