An e-commerce checkout system processes 5,000 orders/s during flash sales by coordinating inventory reservation, payment processing, and shipping through the Saga pattern with compensating transactions. This 7-component architecture uses queue-based order buffering to absorb 50-100x traffic spikes, idempotent payment keys to prevent double-charging, and TTL-based inventory holds that auto-expire in 15 minutes to avoid deadlocks.
The e-commerce checkout system is a cornerstone system design interview problem because it exercises distributed transaction management, consistency guarantees, and graceful degradation under extreme load. Unlike simple CRUD applications, a checkout pipeline must coordinate multiple independent services — inventory, payment, shipping, and notification — in a way that guarantees atomicity: either the entire order succeeds, or all side effects are rolled back.
At scale, an e-commerce platform like Amazon or Shopify processes thousands of orders per second during peak events (Black Friday, Prime Day, flash sales). The checkout flow must handle inventory contention where hundreds of users attempt to purchase the last remaining items simultaneously. It must integrate with third-party payment processors that have their own latency profiles and failure modes. And it must provide real-time order status updates to users while maintaining data consistency across all services.
The problem becomes especially interesting during flash sales, where traffic can spike 50-100x above baseline within seconds. Naive approaches that lock inventory rows will deadlock under this load. Candidates must design reservation-based inventory management, implement saga patterns for distributed transactions, and use queue-based order processing to smooth traffic bursts without dropping orders.
This template models the complete checkout architecture: API gateway with rate limiting, an order orchestrator service, inventory service with reservation and confirmation states, payment service with idempotency, notification service for confirmation emails, and a message queue connecting the asynchronous steps. The simulation demonstrates how queue depth grows during traffic spikes and how the saga pattern handles partial failures.
## How the Checkout Saga Coordinates Distributed Services
The checkout architecture implements the Saga pattern for distributed transaction coordination. When a user submits an order, the API Gateway authenticates the request, applies rate limiting (critical during flash sales), and forwards it to the Order Orchestrator service. The orchestrator is the central coordinator that drives the multi-step checkout process. Unlike two-phase commit, the Saga pattern breaks the transaction into local transactions, each with a compensating action for rollback, enabling coordination across independently deployed microservices without distributed locks.
## Inventory Reservation and Payment Processing Steps
The checkout saga proceeds in ordered steps: (1) Reserve inventory — the Inventory Service places a temporary hold on the requested items with a TTL (typically 10-15 minutes). This is not a hard lock; it is a soft reservation that expires if not confirmed, preventing abandoned carts from permanently blocking inventory. (2) Process payment — the Payment Service submits a charge to the external payment processor with idempotency keys to prevent double-charging on retries. (3) Confirm inventory — converts the reservation into a committed deduction. (4) Create shipment — the Shipping Service generates a shipping label and estimated delivery date. (5) Send notification — the Notification Service fires a confirmation email and push notification.
## Compensating Transactions and Rollback Strategy
If any step fails, the orchestrator executes compensating transactions in reverse order. A payment failure triggers an inventory release. A shipping failure triggers a payment refund and inventory release. Each compensating action is itself idempotent to handle the case where the orchestrator crashes mid-rollback and restarts. The orchestrator persists the saga state to a durable log after each step, enabling recovery from any failure point. This idempotent compensation design is what makes the Saga pattern safe for production systems where partial failures are inevitable.
## Queue-Based Order Buffering During Flash Sales
A message queue (RabbitMQ or SQS) sits between the orchestrator and downstream services to absorb traffic spikes. During a flash sale, the queue buffers incoming orders and the downstream services process them at their maximum sustainable throughput. Users see an "order received" confirmation immediately, with status updates pushed via WebSocket as each saga step completes. Queue depth is the key operational metric during spikes: if it grows faster than consumers can drain it, processing latency increases but no orders are dropped. This architecture trades a few seconds of processing latency for dramatically higher throughput and resilience.
The checkout flow implements an orchestration-based Saga pattern where the Order Orchestrator coordinates a multi-step distributed transaction across independent services. Unlike two-phase commit (2PC), each saga step is a local transaction with a compensating action for rollback. The critical insight is that the user receives an order confirmation immediately after the order is accepted into the queue — the actual saga execution happens asynchronously.
During flash sales, the message queue absorbs traffic spikes that would otherwise overwhelm downstream services. The queue depth becomes the key metric: if it grows faster than consumers can drain it, processing latency increases but no orders are dropped. Each downstream service processes at its maximum sustainable throughput regardless of inbound traffic.
The compensation flow is the most complex part of the design. If payment fails after inventory has been reserved, the orchestrator must release the reservation. If shipping fails after payment succeeds, the orchestrator must issue a refund AND release inventory. Every compensating action is idempotent — the orchestrator may crash and restart mid-rollback, replaying compensations safely.
Step-by-Step Walkthrough
Pseudocode
// Saga Orchestrator — drives the checkout lifecycle
async function executeCheckoutSaga(order: Order):
let reservation, payment, shipment
try:
// Step 1: Reserve inventory (soft hold, 15-min TTL)
reservation = await inventoryService.reserve(
order.items, { ttl: "15m" }
) // ~80ms
// Step 2: Charge payment (idempotent)
payment = await paymentService.charge({
amount: order.total,
method: order.paymentMethod,
idempotencyKey: `order-${order.id}`
}) // ~500ms (external processor)
// Step 3: Confirm inventory (convert reservation → deduction)
await inventoryService.confirm(reservation.id) // ~30ms
// Step 4: Create shipment
shipment = await shippingService.create({
items: order.items, address: order.address
}) // ~200ms
// Step 5: Notify user
await notificationService.send(order.userId, "order_confirmed", {
orderId: order.id, trackingId: shipment.trackingId
})
order.status = "COMPLETED"
catch (error):
// Compensate in reverse order
await compensate(order, { reservation, payment, shipment }, error)
async function compensate(order, refs, error):
if refs.shipment:
await shippingService.cancel(refs.shipment.id) // idempotent
if refs.payment:
await paymentService.refund(refs.payment.transactionId) // idempotent
if refs.reservation:
await inventoryService.release(refs.reservation.id) // idempotent
order.status = "FAILED"
order.failureReason = error.messageChoice
Orchestration-based Saga with compensating transactions
Rationale
Two-phase commit (2PC) does not scale across independently deployed microservices and creates a distributed lock that blocks all participants. The Saga pattern breaks the transaction into local transactions with compensating actions for rollback. Orchestration (vs. choreography) keeps the flow logic centralized in one service, making it easier to reason about, monitor, and debug.
Choice
Reservation-based with TTL expiry
Rationale
Hard locks on inventory rows cause deadlocks during flash sales. Reservation-based management uses optimistic concurrency: a reservation temporarily reduces available count, and if not confirmed within the TTL, the count is automatically restored. This prevents abandoned carts from blocking inventory while maintaining consistency.
Choice
Client-generated idempotency keys per order
Rationale
Network failures during payment processing can leave the system uncertain whether a charge succeeded. Idempotency keys ensure that retrying a failed payment request does not result in double-charging. The order ID serves as a natural idempotency key, and the payment service deduplicates based on it.
Choice
Queue-based order buffering with backpressure
Rationale
During flash sales, order volume can exceed downstream service capacity by 50-100x. A message queue absorbs the burst, and each service consumes at its sustainable rate. Backpressure signals propagate upstream: when the queue reaches a depth threshold, the API gateway begins returning 'order queued' responses with estimated wait times rather than rejecting requests.
Choice
WebSocket push with polling fallback
Rationale
Users expect real-time updates as their order progresses through the saga steps. WebSocket connections push status changes instantly. For clients that cannot maintain WebSocket connections (corporate firewalls, mobile backgrounding), a polling endpoint provides the same information with slightly higher latency.
Target RPS
5,000 orders/s (peak flash sale)
Latency (p99)
<2s (order acceptance)
Storage
~500 GB/year (order records)
Availability
99.95%
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
The Saga pattern is a distributed transaction technique where a long-lived transaction is broken into a sequence of local transactions, each with a compensating action for rollback. In e-commerce, the checkout flow spans multiple services (inventory, payment, shipping) that cannot participate in a single database transaction. The Saga pattern coordinates these services with eventual consistency, rolling back completed steps if a later step fails.
Flash sales require a multi-layered approach: (1) Rate limiting at the API gateway to cap incoming request rate. (2) Queue-based order buffering to decouple acceptance from processing. (3) Reservation-based inventory with TTLs to prevent overselling without hard locks. (4) Pre-warming caches with product data before the sale starts. (5) Auto-scaling compute resources based on queue depth metrics. The key insight is accepting orders into a queue instantly and processing them asynchronously.
Overselling prevention uses optimistic concurrency control with atomic inventory operations. The inventory service performs an atomic compare-and-swap: decrement available count only if it is greater than or equal to the requested quantity. This is implemented as a single atomic database operation (UPDATE ... WHERE available >= quantity) that serializes concurrent requests at the database level without application-level locks.
Payment failures trigger the saga's compensating transactions in reverse order. The orchestrator releases the inventory reservation (restoring the available count), marks the order as failed, and notifies the user. If the failure is transient (timeout, network error), the orchestrator retries with the same idempotency key. If the payment processor confirms the charge but the acknowledgment is lost, the idempotency key prevents double-charging on retry.
Asynchronous processing is strongly preferred for production e-commerce systems. Synchronous checkout holds the HTTP connection open while coordinating multiple services, creating long request chains vulnerable to cascading timeouts. Asynchronous processing accepts the order immediately (returning an order ID), buffers it in a queue, and processes the saga steps independently. Users receive real-time status updates via WebSocket or polling.
In orchestration, a central Order Orchestrator drives each step sequentially and owns the rollback logic, making the flow easy to trace, monitor, and debug from a single service. In choreography, each service listens for events and triggers the next step independently, eliminating the single coordinator but scattering the transaction logic across services. Orchestration is preferred for checkout flows because the 4-5 step sequence has strict ordering and the failure modes are complex: a payment refund must only happen if inventory was already reserved. Choreography shines for loosely-coupled workflows like post-purchase analytics where ordering is less critical.
Start with peak inbound rate: 50,000 orders/s during a flash sale. If downstream services (inventory, payment, shipping) process at a combined throughput of 5,000 orders/s, the queue grows at 45,000 orders/s. A 60-second spike fills the queue to 2.7 million orders. At approximately 2 KB per order message, that is 5.4 GB of queue storage. Processing the backlog takes 2.7M / 5,000 = 540 seconds (9 minutes). Interviewers want to see that you can reason about queue depth as a function of arrival rate minus drain rate and translate it into concrete infrastructure requirements.
The primary failure mode is TTL expiry during slow payment processing: if the payment provider takes longer than the 15-minute reservation window, the inventory is released while the charge is still pending. Mitigation involves heartbeat-based TTL extension, where the orchestrator renews the reservation every 5 minutes while waiting for payment. A second failure mode is clock skew across services causing premature expiry. Using a centralized TTL service (Redis EXPIRE) rather than per-service clocks eliminates this. A third risk is the orchestrator crashing between payment confirmation and inventory confirmation, leaving a charged order with unreserved stock; idempotent retry of the confirmation step on orchestrator restart resolves this.
Sign in to join the discussion.
Ready to design your own E-Commerce Checkout?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator