Vetora logo
Medium7 componentsInterview: High

E-Commerce Checkout

An e-commerce checkout system processes 5,000 orders/s during flash sales by coordinating inventory reservation, payment processing, and shipping through the Saga pattern with compensating transactions. This 7-component architecture uses queue-based order buffering to absorb 50-100x traffic spikes, idempotent payment keys to prevent double-charging, and TTL-based inventory holds that auto-expire in 15 minutes to avoid deadlocks.

TransactionsQueuesSaga
Problem Statement

The e-commerce checkout system is a cornerstone system design interview problem because it exercises distributed transaction management, consistency guarantees, and graceful degradation under extreme load. Unlike simple CRUD applications, a checkout pipeline must coordinate multiple independent services — inventory, payment, shipping, and notification — in a way that guarantees atomicity: either the entire order succeeds, or all side effects are rolled back.

At scale, an e-commerce platform like Amazon or Shopify processes thousands of orders per second during peak events (Black Friday, Prime Day, flash sales). The checkout flow must handle inventory contention where hundreds of users attempt to purchase the last remaining items simultaneously. It must integrate with third-party payment processors that have their own latency profiles and failure modes. And it must provide real-time order status updates to users while maintaining data consistency across all services.

The problem becomes especially interesting during flash sales, where traffic can spike 50-100x above baseline within seconds. Naive approaches that lock inventory rows will deadlock under this load. Candidates must design reservation-based inventory management, implement saga patterns for distributed transactions, and use queue-based order processing to smooth traffic bursts without dropping orders.

This template models the complete checkout architecture: API gateway with rate limiting, an order orchestrator service, inventory service with reservation and confirmation states, payment service with idempotency, notification service for confirmation emails, and a message queue connecting the asynchronous steps. The simulation demonstrates how queue depth grows during traffic spikes and how the saga pattern handles partial failures.

Architecture Overview

## How the Checkout Saga Coordinates Distributed Services

The checkout architecture implements the Saga pattern for distributed transaction coordination. When a user submits an order, the API Gateway authenticates the request, applies rate limiting (critical during flash sales), and forwards it to the Order Orchestrator service. The orchestrator is the central coordinator that drives the multi-step checkout process. Unlike two-phase commit, the Saga pattern breaks the transaction into local transactions, each with a compensating action for rollback, enabling coordination across independently deployed microservices without distributed locks.

## Inventory Reservation and Payment Processing Steps

The checkout saga proceeds in ordered steps: (1) Reserve inventory — the Inventory Service places a temporary hold on the requested items with a TTL (typically 10-15 minutes). This is not a hard lock; it is a soft reservation that expires if not confirmed, preventing abandoned carts from permanently blocking inventory. (2) Process payment — the Payment Service submits a charge to the external payment processor with idempotency keys to prevent double-charging on retries. (3) Confirm inventory — converts the reservation into a committed deduction. (4) Create shipment — the Shipping Service generates a shipping label and estimated delivery date. (5) Send notification — the Notification Service fires a confirmation email and push notification.

## Compensating Transactions and Rollback Strategy

If any step fails, the orchestrator executes compensating transactions in reverse order. A payment failure triggers an inventory release. A shipping failure triggers a payment refund and inventory release. Each compensating action is itself idempotent to handle the case where the orchestrator crashes mid-rollback and restarts. The orchestrator persists the saga state to a durable log after each step, enabling recovery from any failure point. This idempotent compensation design is what makes the Saga pattern safe for production systems where partial failures are inevitable.

## Queue-Based Order Buffering During Flash Sales

A message queue (RabbitMQ or SQS) sits between the orchestrator and downstream services to absorb traffic spikes. During a flash sale, the queue buffers incoming orders and the downstream services process them at their maximum sustainable throughput. Users see an "order received" confirmation immediately, with status updates pushed via WebSocket as each saga step completes. Queue depth is the key operational metric during spikes: if it grows faster than consumers can drain it, processing latency increases but no orders are dropped. This architecture trades a few seconds of processing latency for dramatically higher throughput and resilience.

Architecture Preview
Loading architecture preview...
Request Flow — Checkout Saga

The checkout flow implements an orchestration-based Saga pattern where the Order Orchestrator coordinates a multi-step distributed transaction across independent services. Unlike two-phase commit (2PC), each saga step is a local transaction with a compensating action for rollback. The critical insight is that the user receives an order confirmation immediately after the order is accepted into the queue — the actual saga execution happens asynchronously.

During flash sales, the message queue absorbs traffic spikes that would otherwise overwhelm downstream services. The queue depth becomes the key metric: if it grows faster than consumers can drain it, processing latency increases but no orders are dropped. Each downstream service processes at its maximum sustainable throughput regardless of inbound traffic.

The compensation flow is the most complex part of the design. If payment fails after inventory has been reserved, the orchestrator must release the reservation. If shipping fails after payment succeeds, the orchestrator must issue a refund AND release inventory. Every compensating action is idempotent — the orchestrator may crash and restart mid-rollback, replaying compensations safely.

Loading diagram...

Step-by-Step Walkthrough

  1. 1User submits checkout with items, payment method, and shipping address. The API Gateway applies flash-sale rate limiting (token bucket, 1000 req/s per product) and enqueues the order into the message queue.
  2. 2The gateway returns 202 Accepted with an order ID immediately — the user doesn't wait for saga completion. A WebSocket connection pushes real-time status updates as each step completes.
  3. 3The Order Orchestrator dequeues the order and begins the saga. Step 1: Reserve inventory — the Inventory Service places a soft hold with a 15-minute TTL. If the saga doesn't confirm within 15 minutes, the reservation auto-expires.
  4. 4Step 2: Process payment — the Payment Service charges the user's payment method with an idempotency key derived from the order ID. If the charge was already processed (retry after orchestrator crash), the payment provider returns the original result.
  5. 5Step 3: Confirm inventory — converts the soft reservation into a committed deduction. The items are now permanently removed from available stock.
  6. 6Step 4: Create shipment — the Shipping Service generates a shipping label and tracking number based on the delivery address and package dimensions.
  7. 7Step 5: Send notification — the Notification Service sends an order confirmation email with the tracking ID. The orchestrator marks the order as COMPLETED.
  8. 8If any step fails, the orchestrator executes compensating transactions in reverse: payment failure → release inventory reservation; shipping failure → refund payment + release inventory. Each compensation is idempotent for crash safety.

Pseudocode

// Saga Orchestrator — drives the checkout lifecycle
async function executeCheckoutSaga(order: Order):
    let reservation, payment, shipment

    try:
        // Step 1: Reserve inventory (soft hold, 15-min TTL)
        reservation = await inventoryService.reserve(
            order.items, { ttl: "15m" }
        )   // ~80ms

        // Step 2: Charge payment (idempotent)
        payment = await paymentService.charge({
            amount: order.total,
            method: order.paymentMethod,
            idempotencyKey: `order-${order.id}`
        })   // ~500ms (external processor)

        // Step 3: Confirm inventory (convert reservation → deduction)
        await inventoryService.confirm(reservation.id)  // ~30ms

        // Step 4: Create shipment
        shipment = await shippingService.create({
            items: order.items, address: order.address
        })   // ~200ms

        // Step 5: Notify user
        await notificationService.send(order.userId, "order_confirmed", {
            orderId: order.id, trackingId: shipment.trackingId
        })

        order.status = "COMPLETED"

    catch (error):
        // Compensate in reverse order
        await compensate(order, { reservation, payment, shipment }, error)

async function compensate(order, refs, error):
    if refs.shipment:
        await shippingService.cancel(refs.shipment.id)  // idempotent
    if refs.payment:
        await paymentService.refund(refs.payment.transactionId)  // idempotent
    if refs.reservation:
        await inventoryService.release(refs.reservation.id)  // idempotent
    order.status = "FAILED"
    order.failureReason = error.message
Key Design Decisions
Transaction Coordination Pattern

Choice

Orchestration-based Saga with compensating transactions

Rationale

Two-phase commit (2PC) does not scale across independently deployed microservices and creates a distributed lock that blocks all participants. The Saga pattern breaks the transaction into local transactions with compensating actions for rollback. Orchestration (vs. choreography) keeps the flow logic centralized in one service, making it easier to reason about, monitor, and debug.

Inventory Management

Choice

Reservation-based with TTL expiry

Rationale

Hard locks on inventory rows cause deadlocks during flash sales. Reservation-based management uses optimistic concurrency: a reservation temporarily reduces available count, and if not confirmed within the TTL, the count is automatically restored. This prevents abandoned carts from blocking inventory while maintaining consistency.

Payment Idempotency

Choice

Client-generated idempotency keys per order

Rationale

Network failures during payment processing can leave the system uncertain whether a charge succeeded. Idempotency keys ensure that retrying a failed payment request does not result in double-charging. The order ID serves as a natural idempotency key, and the payment service deduplicates based on it.

Traffic Spike Management

Choice

Queue-based order buffering with backpressure

Rationale

During flash sales, order volume can exceed downstream service capacity by 50-100x. A message queue absorbs the burst, and each service consumes at its sustainable rate. Backpressure signals propagate upstream: when the queue reaches a depth threshold, the API gateway begins returning 'order queued' responses with estimated wait times rather than rejecting requests.

Order Status Communication

Choice

WebSocket push with polling fallback

Rationale

Users expect real-time updates as their order progresses through the saga steps. WebSocket connections push status changes instantly. For clients that cannot maintain WebSocket connections (corporate firewalls, mobile backgrounding), a polling endpoint provides the same information with slightly higher latency.

Scale & Performance

Target RPS

5,000 orders/s (peak flash sale)

Latency (p99)

<2s (order acceptance)

Storage

~500 GB/year (order records)

Availability

99.95%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
What is the Saga pattern and why is it used in e-commerce?

The Saga pattern is a distributed transaction technique where a long-lived transaction is broken into a sequence of local transactions, each with a compensating action for rollback. In e-commerce, the checkout flow spans multiple services (inventory, payment, shipping) that cannot participate in a single database transaction. The Saga pattern coordinates these services with eventual consistency, rolling back completed steps if a later step fails.

How do you handle flash sale traffic spikes in an e-commerce system?

Flash sales require a multi-layered approach: (1) Rate limiting at the API gateway to cap incoming request rate. (2) Queue-based order buffering to decouple acceptance from processing. (3) Reservation-based inventory with TTLs to prevent overselling without hard locks. (4) Pre-warming caches with product data before the sale starts. (5) Auto-scaling compute resources based on queue depth metrics. The key insight is accepting orders into a queue instantly and processing them asynchronously.

How do you prevent overselling during high-concurrency checkout?

Overselling prevention uses optimistic concurrency control with atomic inventory operations. The inventory service performs an atomic compare-and-swap: decrement available count only if it is greater than or equal to the requested quantity. This is implemented as a single atomic database operation (UPDATE ... WHERE available >= quantity) that serializes concurrent requests at the database level without application-level locks.

What happens if the payment service fails mid-transaction?

Payment failures trigger the saga's compensating transactions in reverse order. The orchestrator releases the inventory reservation (restoring the available count), marks the order as failed, and notifies the user. If the failure is transient (timeout, network error), the orchestrator retries with the same idempotency key. If the payment processor confirms the charge but the acknowledgment is lost, the idempotency key prevents double-charging on retry.

Should you use synchronous or asynchronous checkout processing?

Asynchronous processing is strongly preferred for production e-commerce systems. Synchronous checkout holds the HTTP connection open while coordinating multiple services, creating long request chains vulnerable to cascading timeouts. Asynchronous processing accepts the order immediately (returning an order ID), buffers it in a queue, and processes the saga steps independently. Users receive real-time status updates via WebSocket or polling.

How would you explain the difference between orchestration and choreography sagas to an interviewer?

In orchestration, a central Order Orchestrator drives each step sequentially and owns the rollback logic, making the flow easy to trace, monitor, and debug from a single service. In choreography, each service listens for events and triggers the next step independently, eliminating the single coordinator but scattering the transaction logic across services. Orchestration is preferred for checkout flows because the 4-5 step sequence has strict ordering and the failure modes are complex: a payment refund must only happen if inventory was already reserved. Choreography shines for loosely-coupled workflows like post-purchase analytics where ordering is less critical.

How would you estimate the queue depth during a flash sale in an interview?

Start with peak inbound rate: 50,000 orders/s during a flash sale. If downstream services (inventory, payment, shipping) process at a combined throughput of 5,000 orders/s, the queue grows at 45,000 orders/s. A 60-second spike fills the queue to 2.7 million orders. At approximately 2 KB per order message, that is 5.4 GB of queue storage. Processing the backlog takes 2.7M / 5,000 = 540 seconds (9 minutes). Interviewers want to see that you can reason about queue depth as a function of arrival rate minus drain rate and translate it into concrete infrastructure requirements.

What are the failure modes of inventory reservation with TTLs and how do you mitigate them?

The primary failure mode is TTL expiry during slow payment processing: if the payment provider takes longer than the 15-minute reservation window, the inventory is released while the charge is still pending. Mitigation involves heartbeat-based TTL extension, where the orchestrator renews the reservation every 5 minutes while waiting for payment. A second failure mode is clock skew across services causing premature expiry. Using a centralized TTL service (Redis EXPIRE) rather than per-service clocks eliminates this. A third risk is the orchestrator crashing between payment confirmation and inventory confirmation, leaving a charged order with unreserved stock; idempotent retry of the confirmation step on orchestrator restart resolves this.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own E-Commerce Checkout?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator