Hard13 componentsInterview: High

Ticketmaster — CDN Edge + Waiting Room

Q: How does the virtual waiting room work technically?

WaitingRoomService uses a Redis sorted set per event: ZADD queue:{event_id} {join_timestamp} {user_id}. The join timestamp as score creates FIFO ordering — users who arrive earlier get lower scores and are dequeued first. Users receive their position via ZRANK queue:{event_id} {user_id}, which returns their 0-indexed position in the sorted set. A background job running at 5K/sec executes ZPOPMIN queue:{event_id} 5000 to dequeue 5,000 users per second. Each dequeued user receives a signed JWT (HS256, payload: user_id, event_id, nonce, exp = now + 900 seconds) via WebSocket push. SeatService validates the JWT signature, expiry, and nonce (stored in a short-lived Redis SET for replay prevention) before processing any SETNX call. Users attempting SETNX without a valid token receive HTTP 403 immediately.

Q: Why does the seat map need sub-1-second freshness and how is it achieved?

When a user sees a seat displayed as available on the map, clicks it, and immediately receives SEAT_UNAVAILABLE, the user experience degrades sharply — especially at onsale when every millisecond matters. The expectation at Taylor Swift scale is that the map reflects the actual state as closely as possible. With 5M concurrent viewers and 50K SETNX operations per second at onsale peak, the map changes rapidly. V2 achieves sub-1-second staleness via Kafka push: SeatService publishes a seat-state-change event synchronously on every SETNX, NotificationWorker consumes and executes SETBIT on AvailabilityCache within the Kafka consumer lag (typically 100-500ms). The entire pipeline from SETNX to map update takes under 1 second without any polling, compared to V1's fixed 2-second TTL polling window.

Q: What is the failure mode if Kafka goes down during onsale?

SeatService continues operating normally — SETNX on SeatHoldCache is independent of Kafka. Seat holds still work with zero double-booking risk. The consequences of Kafka failure are limited to three areas: AvailabilityCache stops receiving bit-flip updates, so seat map staleness grows beyond 1 second (mitigated by a fallback 5-second TTL on AvailabilityCache entries as a circuit breaker); NotificationWorker stops generating QR codes and sending emails; and TicketWorker stops processing confirmations into ticket events. When Kafka recovers, consumers catch up from their last committed offset — no events are lost (Kafka retains events for the configured retention period, typically 7 days). A circuit breaker on TicketStream prevents SeatService from blocking on Kafka publishes if brokers become unavailable.

Q: How does multi-seat booking work with a Redis cluster sharded by event_id?

Sharding by event_id ensures all seats for one event map to the same Redis cluster shard via consistent hashing. A user booking 4 seats together triggers 4 SETNX operations, all landing on the same shard. SeatService pipelines all 4 SETNX calls in a single Redis pipeline batch — reducing round-trip overhead from 4 × 2ms to approximately one 2ms pipeline. If all 4 succeed, the booking proceeds. If any fail, a Lua script (EVAL) atomically releases all successful holds: the script GETs each key, verifies user ownership, and DELetes the keys within a single atomic Redis command. Because all 4 seats are on the same shard, the Lua script executes without cross-shard coordination, which Redis Cluster does not support for multi-key commands.

Q: When would you add a CDN waiting room layer instead of an application-level waiting room?

An application-level waiting room (WaitingRoomService behind API Gateway) can handle up to approximately 1 million simultaneous queue join requests before API Gateway and the application service tier become the bottleneck — even before any SETNX requests reach Redis. When peak traffic exceeds 1M RPS per AZ for queue join operations alone, the waiting room itself becomes the failure point. Lambda@Edge (CDN waiting room) handles queue join logic at the CDN edge, capable of processing 400 Gbps of traffic across hundreds of PoPs before any request reaches origin. For most events — even large stadium concerts — an application-level waiting room is sufficient. Lambda@Edge is appropriate only for truly global events where CDN PoP capacity is needed to absorb the queue join traffic itself.

Taylor Swift-scale ticket booking: CloudFront CDN absorbs 90% of browse traffic at the edge, a virtual WaitingRoomService gates onsale access, a 6-node Redis cluster handles per-seat holds, and a Kafka-updated AvailabilityCache keeps the seat map fresh for 5M concurrent viewers.

TransactionsCDNRedis ClusterKafkaWaiting RoomFAANG Scale

Try in Simulator

Problem Statement

Taylor Swift scale means something precise: 5 million fans attempting to buy tickets for 60,000 seats the millisecond onsale opens. This is an 83:1 demand-to-supply ratio. Without queuing and demand shaping, every user simultaneously fires a seat hold request to SeatService at the same instant — and SeatHoldCache sees 5 million concurrent SETNX operations at time zero. Even a well-provisioned Redis cluster cannot handle 5 million simultaneous connections in a single burst. The result is connection pool exhaustion, lock-up, and total unavailability during the most critical 60 seconds of the event's commercial life. V2 exists to solve this specific problem: how do you serve 5 million users fairly and reliably at a moment when demand is 83 times supply?

The CDN layer is the first line of defense. Event browse pages contain content that changes rarely: artist biography, venue map images, pricing tier descriptions, event date and time, and seat map tile images (static venue diagrams). CloudFront caches this content at edge locations worldwide. When 10 million users open the event page simultaneously, 90% of requests are served from the nearest CDN edge with sub-10ms latency and zero origin load. The remaining 10% — cache misses for just-updated content or requests for dynamic data — reach the origin API Gateway. Without CDN, 10 million simultaneous requests hit origin infrastructure directly, requiring 100x the server capacity to handle the same load. The CDN is not just a performance optimization; it is the capacity multiplier that makes sub-100ms event page loads achievable at 10M concurrent users.

The waiting room is the second architectural layer and the one that requires the most careful design. Even with CDN absorbing 90% of browse traffic, 5 million users simultaneously fire seat hold requests to SeatService at onsale open. WaitingRoomService implements a virtual queue: users receive a queue position when they arrive (ZADD with join timestamp as score), see a real-time queue position display (ZRANK for their position), and receive an admission token when they reach the front (ZPOPMIN batch at 5,000/sec). The admission token is a signed JWT with a 15-minute TTL that authorizes one seat hold attempt. SeatService validates the JWT before processing any SETNX — users without a valid token are rejected immediately. The 5,000/sec release rate is matched to SeatService's sustainable SETNX throughput on the Redis cluster, preventing the thundering herd from ever reaching Redis.

The AvailabilityCache architecture is the third critical V2 addition. In V1, the seat availability bitmap had a 2-second polling TTL — acceptable at moderate scale but inadequate when 5 million users each need sub-1-second freshness. The V2 approach replaces TTL polling with event-driven push updates. SeatService publishes a seat-state-change event to Kafka on every SETNX success and every hold release. NotificationWorker consumes these events and executes an atomic bit-flip update on the AvailabilityCache bitmap (SETBIT key position 1 for hold, SETBIT key position 0 for release). Each bit-flip takes under 1ms and keeps the bitmap continuously current. Users see seat state changes reflected in under 1 second — the Kafka consumer lag — rather than up to 2 seconds with TTL polling. At 5M concurrent seat map viewers, this eliminates 5M × 60K = 300 billion Redis ops/sec (TTL polling alternative) in favor of one bit-flip per state change event.

The trade-offs of V2 are real and non-trivial. Thirteen components require dedicated SRE expertise: Kafka cluster management (broker sizing, partition count tuning, consumer group lag alerting), Redis cluster operations (resharding, failover testing, slot migration), and CDN configuration (cache-control headers, origin shield, Lambda@Edge function deployment). The waiting room creates queue position anxiety — users who join the queue at position 800,000 and see the 5K/sec release rate can calculate they have a 160-second wait, which generates its own support load. CDN stale-while-revalidate policies mean a seat sold in the last CDN TTL window may still appear available in the edge cache. These are real operational and product trade-offs that interviewers expect senior candidates to acknowledge and reason about explicitly.

Architecture Overview

The V2 architecture uses 13 components: BuyerClient, CDN (CloudFront), LoadBalancer, SearchService, EventCache, EventDB, WaitingRoomService, SeatService, SeatHoldCache (Redis cluster), OrderDB, TicketStream (Kafka), TicketWorker, NotificationWorker, and AvailabilityCache. The key structural difference from V1 is the addition of the CDN layer (absorbing 90% of browse load), WaitingRoomService (gating seat hold access), the Redis cluster (replacing single-node SeatHoldCache), AvailabilityCache (Kafka-updated bitmap), and NotificationWorker (dual-purpose Kafka consumer).

The onsale request flow follows two distinct phases. In the pre-onsale phase (before the onsale timestamp), users browse events and seat maps through the CDN and SearchService. All event detail pages are warm in the CDN edge cache. AvailabilityCache shows all seats as available. No WaitingRoom activity.

At onsale open, the flow bifurcates. Users attempting to purchase are redirected to the WaitingRoom: their browser calls WaitingRoomService with ZADD queue:{event_id} {timestamp} {user_id}, receiving a queue position. WaitingRoomService runs a background job at 5K/sec that executes ZPOPMIN queue:{event_id} 5000 — dequeueing 5,000 users per second and issuing each a signed JWT admission token (HS256, expiry = now + 15 minutes). The JWT payload contains user_id, event_id, and a nonce. Users receive a WebSocket push notification when their token is issued.

Users with a valid admission JWT call SeatService. SeatService first validates the JWT (signature + expiry + nonce check to prevent replay). Then executes SETNX seat:{event_id}:{seat_id} with 600-second TTL on the Redis cluster. The Redis cluster uses consistent hashing with event_id as the shard key — all seats for one event land on one shard. This is critical for multi-seat bookings (4 seats for the same event are all on the same shard, enabling atomic Lua script operations without cross-shard coordination).

SeatService publishes a seat-state-change event to Kafka on every SETNX success: { event_id, seat_id, state: HELD, user_id, timestamp }. NotificationWorker consumes from Kafka and performs two parallel actions: (1) SETBIT availability:{event_id} {seat_index} 1 on AvailabilityCache (atomic bit-flip, ~1ms), updating the seat map for all concurrent viewers within the Kafka consumer lag; and (2) (optionally) pushing a mobile push notification to the purchaser's device via APNs/FCM. This dual-purpose consumer design eliminates a separate consumer group for availability updates.

TicketWorker consumes from a separate TicketStream Kafka topic (seat_confirmed events published by SeatService on confirmed purchases) and generates QR codes + sends emails asynchronously. TicketWorker is the same as V1 but scaled to handle higher confirmation throughput via additional consumer instances. The TicketStream is partitioned by event_id to ensure per-event ordering of ticket events.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

CDN Edge vs Application-Level Browse Caching

Choice

CloudFront CDN with Lambda@Edge for waiting room logic at edge locations

Rationale

CDN edge handles 10M+ RPS without hitting any origin server — request is served from the nearest PoP with sub-10ms latency. Application-level caching (even EventCache in V1) still requires a request to reach origin infrastructure. At 10M concurrent users, origin infrastructure would need to be 100x over-provisioned without CDN. Lambda@Edge allows the waiting room queue check to run at the CDN edge for users attempting to access seat hold flows, rejecting users without tokens before any request reaches origin.

Redis Cluster Sharded by event_id

Choice

6-node Redis cluster with consistent hashing, event_id as shard key

Rationale

Sharding by event_id ensures all seats for one event land on the same Redis shard. This is critical for multi-seat bookings: 4 seats from the same event are all on the same node, allowing atomic Lua script operations (SETNX all 4, release all 4 on failure) without cross-shard coordination. Alternative sharding by seat_id would distribute seats across shards, requiring cross-shard atomic operations which Redis Cluster does not support natively.

Kafka-Pushed AvailabilityCache vs TTL Polling

Choice

NotificationWorker consumes seat-state-change events and does O(1) SETBIT updates

Rationale

TTL polling (V1's 2-second bitmap refresh) creates periodic read bursts on SeatHoldCache and always has up to 2 seconds of staleness. Kafka push updates the AvailabilityCache bitmap within the consumer lag (~100-500ms) with zero polling pressure on SeatHoldCache. Each SETBIT is a single O(1) Redis operation regardless of event seat count. At 5M concurrent seat map viewers, every 100ms of reduced staleness is worth the additional Kafka consumer complexity.

NotificationWorker Dual Purpose (Availability + Push Notifications)

Choice

Single Kafka consumer group handles both seat map bitmap updates and mobile push notifications

Rationale

Seat-state-change events are needed by both the availability bitmap update path (SETBIT on AvailabilityCache) and the mobile push notification path (notify the purchaser their seat is confirmed). Running two separate consumer groups for the same Kafka topic would double consumer group lag management complexity and double the Kafka partition assignment overhead. A single consumer handles both in one pass, reading each event once and performing both actions in parallel.

Virtual WaitingRoom vs Hard Rate Limiting

Choice

Queue-based waiting room with Redis sorted set and JWT admission tokens

Rationale

Hard rate limiting (e.g., 5K RPS via API Gateway throttling) rejects users randomly and immediately — users get 429 errors and must retry, creating a thundering retry herd. A virtual waiting room accepts all users into a FIFO queue, gives them a position number and estimated wait time, and releases them in order. This converts 5M simultaneous hostile requests into a controlled 5K/sec orderly stream. Users with a queue position have a concrete expectation of admission time, reducing support load compared to random rejection.

Scale & Performance

Target RPS

10M RPS at CDN edge; ~1M reaches origin; 5K seat holds/sec (waiting room controlled)

Latency (p99)

5ms CDN edge hit, 2ms seat hold (SETNX cluster), <1s seat map staleness, <30s ticket delivery

Storage

~500 GB; Redis cluster 16GB for seat holds + availability across thousands of events

Availability

99.99% for hold path (Redis cluster HA + circuit breakers); 99.9% for browse (CDN failover)

Database Schema (HLD)

events and seats (EventDB — PostgreSQL)

Event catalog and seat geometry store. Events contains event metadata (artist, venue, date, pricing tiers). Seats contains seat geometry (section, row, number) used to render the seat map grid. Neither table stores real-time availability — AvailabilityCache (Redis bitmap) owns that. EventDB is read-mostly at V2 scale since CDN caches event pages and SearchService caches seat geometry.

event_id UUID PK (events table primary key)name VARCHAR (event display name)artist VARCHAR (performer or team name)venue_name VARCHAR (venue name)city VARCHAR (for geographic filtering)event_date TIMESTAMPTZ (onsale and event datetime)total_seats INT (total seats in venue)seat_id UUID PK (seats table primary key)section VARCHAR (section name: Floor, Orchestra, Upper)row_num VARCHAR (row identifier)seat_number INT (seat number within row)price_cents INT (seat price in cents)

Indexes: idx_events_city_date ON events(city, event_date) — filtered browse, idx_seats_event ON seats(event_id) — seat geometry fetch for map

At V2 scale, EventDB is accessed only on CDN/SearchService cache miss (~1% of browse traffic). EventDB is not on the critical path for seat holds — SeatHoldCache (Redis cluster) handles that entirely. Read replicas provide additional read capacity for cache warming after deploy.

orders (OrderDB — PostgreSQL RDS Multi-AZ)

Immutable confirmed purchase records written by SeatService on successful payment confirmation. Source of truth for financial records, refunds, and ticketing. TicketWorker reads from TicketStream (Kafka) rather than this table to avoid polling amplification. At V2 scale, OrderDB receives only confirmed purchase writes — approximately 5K inserts/sec at onsale peak for a single major event.

order_id UUID PK (unique order identifier)user_id UUID (purchasing user)seat_id UUID FK (purchased seat)event_id UUID FK (event)status VARCHAR (CONFIRMED / CANCELLED / REFUNDED)price_paid_cents INT (price at time of purchase)confirmed_at TIMESTAMPTZ (payment confirmation timestamp)ticket_generated BOOL (updated by TicketWorker on QR delivery)

Indexes: idx_orders_user ON (user_id) — user purchase history, idx_orders_event ON (event_id) — event capacity and sales reporting, idx_orders_seat UNIQUE ON (seat_id) — enforce single confirmed order per seat

Multi-AZ PostgreSQL for 99.99% availability. PgBouncer connection pooler handles 5K concurrent inserts without exhausting the DB connection pool. The UNIQUE index on seat_id provides a database-level double-booking guard as a last-resort backstop behind Redis SETNX.

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

How does the virtual waiting room work technically?

WaitingRoomService uses a Redis sorted set per event: ZADD queue:{event_id} {join_timestamp} {user_id}. The join timestamp as score creates FIFO ordering — users who arrive earlier get lower scores and are dequeued first. Users receive their position via ZRANK queue:{event_id} {user_id}, which returns their 0-indexed position in the sorted set. A background job running at 5K/sec executes ZPOPMIN queue:{event_id} 5000 to dequeue 5,000 users per second. Each dequeued user receives a signed JWT (HS256, payload: user_id, event_id, nonce, exp = now + 900 seconds) via WebSocket push. SeatService validates the JWT signature, expiry, and nonce (stored in a short-lived Redis SET for replay prevention) before processing any SETNX call. Users attempting SETNX without a valid token receive HTTP 403 immediately.

Why does the seat map need sub-1-second freshness and how is it achieved?

When a user sees a seat displayed as available on the map, clicks it, and immediately receives SEAT_UNAVAILABLE, the user experience degrades sharply — especially at onsale when every millisecond matters. The expectation at Taylor Swift scale is that the map reflects the actual state as closely as possible. With 5M concurrent viewers and 50K SETNX operations per second at onsale peak, the map changes rapidly. V2 achieves sub-1-second staleness via Kafka push: SeatService publishes a seat-state-change event synchronously on every SETNX, NotificationWorker consumes and executes SETBIT on AvailabilityCache within the Kafka consumer lag (typically 100-500ms). The entire pipeline from SETNX to map update takes under 1 second without any polling, compared to V1's fixed 2-second TTL polling window.

What is the failure mode if Kafka goes down during onsale?

SeatService continues operating normally — SETNX on SeatHoldCache is independent of Kafka. Seat holds still work with zero double-booking risk. The consequences of Kafka failure are limited to three areas: AvailabilityCache stops receiving bit-flip updates, so seat map staleness grows beyond 1 second (mitigated by a fallback 5-second TTL on AvailabilityCache entries as a circuit breaker); NotificationWorker stops generating QR codes and sending emails; and TicketWorker stops processing confirmations into ticket events. When Kafka recovers, consumers catch up from their last committed offset — no events are lost (Kafka retains events for the configured retention period, typically 7 days). A circuit breaker on TicketStream prevents SeatService from blocking on Kafka publishes if brokers become unavailable.

How does multi-seat booking work with a Redis cluster sharded by event_id?

Sharding by event_id ensures all seats for one event map to the same Redis cluster shard via consistent hashing. A user booking 4 seats together triggers 4 SETNX operations, all landing on the same shard. SeatService pipelines all 4 SETNX calls in a single Redis pipeline batch — reducing round-trip overhead from 4 × 2ms to approximately one 2ms pipeline. If all 4 succeed, the booking proceeds. If any fail, a Lua script (EVAL) atomically releases all successful holds: the script GETs each key, verifies user ownership, and DELetes the keys within a single atomic Redis command. Because all 4 seats are on the same shard, the Lua script executes without cross-shard coordination, which Redis Cluster does not support for multi-key commands.

When would you add a CDN waiting room layer instead of an application-level waiting room?

An application-level waiting room (WaitingRoomService behind API Gateway) can handle up to approximately 1 million simultaneous queue join requests before API Gateway and the application service tier become the bottleneck — even before any SETNX requests reach Redis. When peak traffic exceeds 1M RPS per AZ for queue join operations alone, the waiting room itself becomes the failure point. Lambda@Edge (CDN waiting room) handles queue join logic at the CDN edge, capable of processing 400 Gbps of traffic across hundreds of PoPs before any request reaches origin. For most events — even large stadium concerts — an application-level waiting room is sufficient. Lambda@Edge is appropriate only for truly global events where CDN PoP capacity is needed to absorb the queue join traffic itself.

Related Templates

Ticketmaster — Ticket Booking System (Parent)Ticketmaster — V0: Naive (SELECT FOR UPDATE)Ticketmaster — V1: Per-Seat Redis Hold

Discussion

Ready to design your own Ticketmaster?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator