Vetora logo
Hard10 componentsInterview: High

Flash Sale — Queue-Based Commerce

Design a flash sale system that handles millions of concurrent buyers with virtual queues, token-gated reservations, and atomic inventory counters to guarantee zero overselling.

queuesrate-limitingredise-commerce
Problem Statement

Flash sales represent one of the most extreme traffic patterns in system design. When a limited-quantity sale opens, millions of users attempt to purchase simultaneously within a narrow window of seconds. The system must handle traffic spikes of 500K requests per second or more at the moment the sale begins, then taper rapidly as inventory is exhausted. This burst pattern, often called a thundering herd, is fundamentally different from steady-state traffic and demands specialized architecture.

This problem appears frequently in system design interviews at companies that operate e-commerce platforms, ticketing services, or any marketplace with limited inventory. Interviewers are looking for candidates who understand demand shaping, atomic inventory operations, and the trade-offs between fairness and throughput. A naive approach where every user hits the inventory database simultaneously will collapse under load, so the candidate must articulate a strategy for throttling demand before it reaches the backend.

In production, companies like Ticketmaster, Amazon (Prime Day lightning deals), and Shopify handle flash sale events that generate traffic orders of magnitude above their normal baseline. The 2024 Taylor Swift Eras Tour presale famously overwhelmed Ticketmaster with 14 million users hitting the site simultaneously. These real-world failures illustrate why virtual queues, token-based admission control, and atomic inventory counters are essential rather than optional.

The key challenges include preventing overselling with zero tolerance (selling more items than exist in stock), ensuring fairness so that early arrivals are served first, maintaining system availability under 1000x normal load, and processing payments asynchronously so that the reservation acknowledgment is fast even when payment gateways are slow.

Architecture Overview

The architecture uses a virtual queue with controlled token release to shape demand from millions of concurrent users down to a sustainable backend rate. When the sale opens, all users enter a virtual queue backed by a Redis sorted set (ZADD with timestamp score for FIFO ordering). A dedicated TokenWorker releases purchase tokens at a controlled rate of 1,000 per second by reading the next batch from the queue via ZPOPMIN. Only users holding a valid token can proceed to the reservation endpoint.

The system is split into two services with distinct scaling profiles. QueueService handles the initial burst of 500K RPS queue-join requests. It is intentionally simple, performing a single Redis ZADD per request, and runs on 20 pods to absorb the burst. ReservationService handles the token-gated flow at a controlled 1K RPS. It validates the purchase token, performs an atomic DECR on an InventoryCache Redis key to prevent overselling, persists the order to a PostgreSQL database, and publishes a checkout event to Kafka.

The data layer separates concerns across two Redis clusters and one relational database. QueueCache (a 6-node Redis cluster) stores the virtual queue and purchase tokens, handling 500K ZADD operations per second at sale start. InventoryCache (a 3-node Redis cluster) holds atomic inventory counters where DECR guarantees that stock can never go negative. OrderDB (PostgreSQL with 16 partitions and 3 replicas) provides durable storage for reservation records with strong consistency.

Asynchronous payment processing is handled by a Kafka-backed checkout pipeline. ReservationService publishes checkout events to a 32-partition Kafka topic, and CheckoutWorker instances consume these events to process payments via an external gateway. Failed payments trigger an INCR on InventoryCache to release the reserved stock back to the pool. This decoupling ensures that users see a fast reservation acknowledgment while payment settlement happens in the background.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Virtual Queue for Demand Shaping

Choice

Redis sorted set with controlled token release at 1,000/sec

Rationale

Without a queue, 10 million users hitting inventory simultaneously creates a thundering herd that overwhelms even Redis at 10M ops/sec on a single key. The virtual queue reduces backend load by 10,000x while maintaining FIFO fairness. Users join the queue in O(log N) via ZADD, and the TokenWorker releases tokens at a controlled rate that the reservation backend can sustain.

Atomic Redis DECR for Inventory

Choice

Redis DECR instead of database transactions

Rationale

Redis DECR is atomic, single-threaded, and completes in under 1ms. A database transaction using SELECT FOR UPDATE followed by UPDATE and COMMIT takes 50-100ms and creates lock contention at 1,000 concurrent requests. The atomic DECR guarantees zero overselling because the operation and its return value are indivisible, eliminating race conditions entirely.

Separate Queue and Reservation Services

Choice

Two independent services with different scaling profiles

Rationale

QueueService handles 500K RPS of simple queue joins (a single ZADD to Redis), while ReservationService handles 1K RPS of complex reservation logic (token validation, inventory DECR, DB write, Kafka publish). Separating them prevents the 500K-RPS queue-join flood from starving the reservation path. Each service scales independently based on its own throughput requirements.

Async Payment via Kafka

Choice

Kafka event stream for checkout processing

Rationale

Payment gateway calls take 200-500ms and can fail. Blocking the reservation response on payment would push user-facing latency above 500ms and cause timeouts during peak load. Kafka decouples reservation acknowledgment from payment processing, so users see a confirmed reservation immediately while CheckoutWorker settles payment asynchronously. Failed payments release stock back to inventory via INCR.

Scale & Performance

Target RPS

500K RPS burst at sale start (queue joins); 1K RPS sustained (token-gated reservations)

Latency (p99)

p99 < 50ms for reservation; < 15ms for queue join

Storage

~2 GB Redis working set for 10M queue entries; PostgreSQL for durable order records

Availability

99.99% during sale window with pre-provisioned capacity and no cold starts

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How does the system prevent overselling during a flash sale?

The system uses atomic Redis DECR operations on inventory counters. When a user with a valid purchase token calls the reservation endpoint, ReservationService performs DECR on the inventory key in Redis. If the result is greater than or equal to zero, the reservation succeeds. If the result is negative (stock exhausted), the reservation fails and the counter is restored with INCR. Because Redis is single-threaded, the DECR operation and its return value are indivisible, making it impossible for two concurrent requests to both see a positive count on the last item. This guarantees zero overselling regardless of concurrency level.

Why use a virtual queue instead of a simple rate limiter?

A rate limiter rejects excess requests, meaning millions of users get error responses and must retry, creating even more load. A virtual queue accepts all users, assigns them a position, and serves them in order. This provides fairness (FIFO ordering by arrival time), reduces retry storms (users wait instead of retrying), and shapes demand to exactly the rate the backend can handle. The queue also provides a better user experience because users can see their position and estimated wait time rather than receiving opaque error messages.

What happens if a user reserves an item but never completes payment?

Each reservation has a 5-minute TTL. If the CheckoutWorker does not receive a successful payment confirmation within 5 minutes, the reservation expires. A background process detects expired reservations, updates their status to EXPIRED in the orders database, and performs an INCR on the Redis inventory counter to release the stock back to the pool. This ensures that abandoned reservations do not permanently drain inventory. The TTL is configurable per sale event to balance between giving users enough time to pay and releasing stock quickly for other buyers.

How does this architecture handle bot traffic?

The API Gateway enforces a 600K RPS global rate limit to prevent raw volumetric attacks. Per-user rate limiting on the queue-join endpoint ensures each user can only join a sale queue once. However, bots can still game queue position by joining microseconds after the sale opens using automated scripts. Production-grade mitigation requires additional layers such as CAPTCHA at queue join, device fingerprinting, proof-of-work challenges, or lottery-based admission. These measures are not modeled in the simulation but are critical for real-world deployments.

Can this architecture scale beyond 10,000 items per sale?

Yes, but the single Redis inventory key becomes a hot spot at very high reservation rates. At the template's 1K reservations per second, a single Redis key handles the load easily. For sales with 100K+ items and 10K+ reservations per second, the inventory counter should be sharded across multiple Redis keys (sub-bucket inventory pattern). Each shard holds a portion of the total stock, and reservation requests are distributed across shards. This trades some complexity for horizontal scalability on the inventory path.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Flash Sale?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator