Medium5 componentsInterview: High

Rate Limiter — API Throttling

Q: What is the difference between token bucket and sliding window rate limiting?

Token bucket maintains a counter of available tokens that refills at a steady rate up to a maximum capacity. Each request consumes one token; when tokens are exhausted, requests are rejected. Sliding window divides time into fixed windows and counts requests per window. The key difference is burst handling: token bucket allows short bursts up to the bucket capacity (tokens accumulated during idle periods), while sliding window strictly enforces the per-window limit. Token bucket provides a smoother, more forgiving experience for API consumers and is the standard choice at AWS, Stripe, and GitHub.

Q: Why use Redis Lua scripts instead of separate GET and SET commands?

A separate GET-check-SET sequence introduces a race condition: two concurrent requests could both read tokens=1, both determine the request is allowed, both decrement, and both pass — exceeding the rate limit. The Lua script executes atomically within Redis, meaning no other command can interleave between the refill calculation, token check, and decrement. This eliminates the race condition without requiring distributed locks. A single Lua script call completes in one network round-trip (approximately 2ms), making it both correct and fast.

Q: How does a distributed rate limiter maintain consistency across multiple servers?

By storing all token bucket state in a centralized Redis cluster, every rate limiter instance queries the same source of truth. When a user sends requests that are load-balanced across different server instances, each instance checks and updates the same Redis key. The atomic Lua script ensures that concurrent updates from different servers are serialized correctly. This centralized approach guarantees that users see a consistent remaining quota regardless of which server handles their request, at the cost of a Redis round-trip on every request.

Q: What happens when the Redis backing store goes down?

The rate limiter implements a configurable failure policy. The default is fail-open: if Redis is unreachable, all requests are allowed through to the backend without rate limiting. This preserves service availability during Redis outages but temporarily disables throttling. For security-critical endpoints (authentication, payment), a fail-closed policy can be configured where all requests are blocked if rate limiting cannot be enforced. Redis Cluster with replication provides high availability to minimize the frequency and duration of such failure scenarios.

Q: How do you implement tiered rate limits for free vs. premium API users?

Each API key or user ID maps to a tier configuration that specifies the token bucket parameters: capacity (maximum burst size) and refill rate (sustained requests per second). Free-tier users might have a capacity of 20 tokens with a refill rate of 2 per second, while premium users get a capacity of 200 tokens with a refill rate of 50 per second. The rate limiter looks up the tier configuration when encountering a key for the first time and creates the bucket with the appropriate parameters. Tier changes take effect on the next request, as the Lua script reads the configuration alongside the bucket state.

Design a distributed rate limiter using token bucket algorithm with Redis Lua scripts for atomic decisions under 5ms, protecting backend services from overload and abuse.

InfrastructureRedisToken Bucket

Try in Simulator

Problem Statement

Rate limiting is a foundational infrastructure problem that appears in system design interviews across all levels because it sits at the intersection of distributed systems coordination, low-latency requirements, and API security. Every production API — from Stripe's payment endpoints to GitHub's REST API to AWS service quotas — relies on rate limiting to protect backend services from overload, prevent abuse, and enforce fair usage policies. Designing a rate limiter tests a candidate's understanding of distributed state management, atomicity guarantees, and the trade-offs between consistency and performance.

At scale, a rate limiter must make decisions for millions of unique API keys or user identifiers, each with potentially different quota configurations (free tier vs. premium tier). Every inbound request must pass through the rate limiter before reaching the backend, meaning the decision latency directly adds to every API call. A target of under 5ms at p99 for rate-limit decisions is typical in production systems. The limiter must also be consistent across all service instances — a user sending requests to different servers should see the same remaining quota, which rules out purely local counting approaches.

The token bucket algorithm is the industry-standard approach used by AWS, Stripe, GitHub, and most major API platforms. It naturally handles burst traffic: a user who has not made requests recently accumulates tokens up to the bucket capacity, allowing short traffic spikes without rejection. This is a significant advantage over fixed-window or sliding-window counters, which reject any instantaneous spike even if the average rate is well within limits. The burst tolerance property makes token bucket more user-friendly and reduces false-positive rejections during legitimate traffic bursts.

Beyond the core algorithm, interviewers expect candidates to discuss failure modes (what happens when Redis is down?), multi-tier rate limiting (per-user, per-route, per-IP), distributed coordination across multiple data centers, and the operational trade-offs between consistency and availability in the rate-limiting decision path.

Architecture Overview

The rate limiter architecture places a lightweight decision layer between clients and the protected backend, ensuring that all traffic is evaluated before reaching application services. The Client sends requests to the RateLimiter component, which extracts the rate-limit key (user ID or API key) from request headers and queries LimitCache (Redis) with an atomic Lua script. The Lua script performs the entire token bucket algorithm in a single Redis round-trip: it calculates how many tokens to refill based on elapsed time since the last access and the configured refill rate, caps the token count at bucket capacity, attempts to consume one token, and returns the decision along with remaining tokens and reset timestamp.

If the request is allowed (tokens were available), the RateLimiter forwards it to the Main Load Balancer, which distributes traffic across BackendService pods using round-robin. The backend processes the request normally and returns the response through the same path. Response headers include X-RateLimit-Remaining and X-RateLimit-Reset to inform clients of their quota status. If the token bucket is empty (rate limit exceeded), the RateLimiter immediately returns HTTP 429 Too Many Requests with a Retry-After header indicating when the next token will be available. No traffic reaches the backend during rejection, which is the core protective property.

The LimitCache (Redis) stores per-key token bucket state using approximately 100 bytes per key: the current token count and the last refill timestamp. A 3-node Redis Cluster provides high availability and handles up to 20,000 concurrent connections. Token bucket entries expire via Redis TTL after one hour of inactivity, preventing unbounded memory growth from inactive keys. At 100 million active keys, the total memory footprint is approximately 10GB — well within the capacity of a 13GB Redis instance.

The RateLimiter itself is stateless and horizontally scalable. Multiple instances can run behind a load balancer, all querying the same centralized Redis cluster for consistent rate-limit state. The token bucket configuration (capacity of 100 tokens, refill rate of 10 tokens per second) is tunable per key or per tier, enabling differentiated rate limits for free and premium API consumers.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Rate Limiting Algorithm

Choice

Token bucket with configurable capacity and refill rate

Rationale

Token bucket naturally accommodates burst traffic — a user who has been idle accumulates tokens up to the bucket capacity, allowing short spikes without rejection. Sliding window counters reject any instantaneous spike above the per-window limit even if the overall rate is within bounds. Token bucket is the industry standard used by AWS API Gateway, Stripe, and GitHub because it provides a better developer experience with fewer false-positive rejections.

State Store

Choice

Centralized Redis with atomic Lua scripts

Rationale

Rate-limit decisions must be globally consistent — a user cannot bypass limits by hitting different server instances. Redis provides a centralized, low-latency store visible to all rate limiter instances. The Lua script executes the entire check-refill-decrement operation atomically within Redis, eliminating the race condition inherent in separate GET-check-SET sequences where two concurrent requests could both read tokens=1 and both pass. One Lua script call equals one network round-trip equals one atomic decision.

Placement in Architecture

Choice

Rate limiter positioned before the load balancer

Rationale

Placing the rate limiter in front of the load balancer ensures that rejected traffic never reaches backend services, protecting them from overload during abuse spikes. The rate limiter is a lightweight component with no business logic, capable of handling much higher throughput than the backend. This placement is critical when the backend has limited capacity and expensive per-request processing costs.

Failure Policy

Choice

Fail-open (allow all traffic) when Redis is unreachable

Rationale

A fail-closed policy would cause a complete service outage if Redis goes down — no requests would be processed. Fail-open allows traffic through temporarily without rate limiting, preserving service availability at the cost of temporarily disabled throttling. This trade-off is appropriate for most APIs where availability is more important than strict rate enforcement. For security-critical endpoints (login, 2FA), a fail-closed policy can be configured per route.

Scale & Performance

Target RPS

20,000 peak rate-limit decisions/s

Latency (p99)

<5ms p99 (rate-limit decision); <100ms (backend response)

Storage

~10 GB (100M active token buckets at ~100 bytes each)

Availability

99.95%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

What is the difference between token bucket and sliding window rate limiting?

Token bucket maintains a counter of available tokens that refills at a steady rate up to a maximum capacity. Each request consumes one token; when tokens are exhausted, requests are rejected. Sliding window divides time into fixed windows and counts requests per window. The key difference is burst handling: token bucket allows short bursts up to the bucket capacity (tokens accumulated during idle periods), while sliding window strictly enforces the per-window limit. Token bucket provides a smoother, more forgiving experience for API consumers and is the standard choice at AWS, Stripe, and GitHub.

Why use Redis Lua scripts instead of separate GET and SET commands?

A separate GET-check-SET sequence introduces a race condition: two concurrent requests could both read tokens=1, both determine the request is allowed, both decrement, and both pass — exceeding the rate limit. The Lua script executes atomically within Redis, meaning no other command can interleave between the refill calculation, token check, and decrement. This eliminates the race condition without requiring distributed locks. A single Lua script call completes in one network round-trip (approximately 2ms), making it both correct and fast.

How does a distributed rate limiter maintain consistency across multiple servers?

By storing all token bucket state in a centralized Redis cluster, every rate limiter instance queries the same source of truth. When a user sends requests that are load-balanced across different server instances, each instance checks and updates the same Redis key. The atomic Lua script ensures that concurrent updates from different servers are serialized correctly. This centralized approach guarantees that users see a consistent remaining quota regardless of which server handles their request, at the cost of a Redis round-trip on every request.

What happens when the Redis backing store goes down?

The rate limiter implements a configurable failure policy. The default is fail-open: if Redis is unreachable, all requests are allowed through to the backend without rate limiting. This preserves service availability during Redis outages but temporarily disables throttling. For security-critical endpoints (authentication, payment), a fail-closed policy can be configured where all requests are blocked if rate limiting cannot be enforced. Redis Cluster with replication provides high availability to minimize the frequency and duration of such failure scenarios.

How do you implement tiered rate limits for free vs. premium API users?

Each API key or user ID maps to a tier configuration that specifies the token bucket parameters: capacity (maximum burst size) and refill rate (sustained requests per second). Free-tier users might have a capacity of 20 tokens with a refill rate of 2 per second, while premium users get a capacity of 200 tokens with a refill rate of 50 per second. The rate limiter looks up the tier configuration when encountering a key for the first time and creates the bucket with the appropriate parameters. Tier changes take effect on the next request, as the Lua script reads the configuration alongside the bucket state.

Related Templates

Distributed Cache Notification System Logging Pipeline

Discussion

Ready to design your own Rate Limiter?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator