The simplest e-commerce checkout: a single monolithic service wrapping the entire checkout flow in one PostgreSQL ACID transaction. Demonstrates why payment processing inside a database transaction is an anti-pattern at scale.
E-commerce checkout is one of the most frequently asked system design interview questions because it combines transactional correctness, concurrent access control, payment integration, and flash-sale scalability into a single problem. When a shopper clicks 'Place Order,' the system must atomically reserve inventory, process payment, and create the order record. If any step fails, all previous steps must be undone. The naive approach solves this with a single ACID database transaction — elegant in theory, catastrophic at scale.
The fundamental tension in checkout design is between transactional correctness and throughput. ACID transactions guarantee that either all steps complete or none do — there can never be a partial order where inventory is decremented but payment fails. This guarantee comes from holding row-level locks on inventory rows for the duration of the transaction. The problem is that payment processing (calling Stripe or PayPal) takes 200-500 milliseconds, and the database locks are held for that entire duration.
During a flash sale, where thousands of shoppers compete for the same limited-stock item, this lock-holding pattern becomes the bottleneck. The 1,000th shopper waiting to buy the same SKU is blocked behind 999 other transactions, each holding locks for 200-500ms. The cumulative lock wait time exceeds any reasonable timeout. The database connection pool exhausts, the service queue fills, and the system effectively stops processing orders — a total failure triggered by the very traffic the business wants most.
This template makes these failure modes visible in simulation. By running at increasing concurrency levels, you can observe the exact inflection point where lock contention degrades throughput, how connection pool exhaustion cascades into service-level failures, and why the checkout p99 latency explodes non-linearly. The comparison with the Saga Orchestrator and Distributed Saga variants quantifies the dramatic improvement from moving payment processing outside the transaction.
Amazon, Shopify, Stripe, and Square all ask variants of this question in their system design interviews. Interviewers expect candidates to identify the lock contention anti-pattern, propose separating payment from the inventory transaction, discuss reservation-based inventory management, and reason about consistency models. This template provides concrete simulation data to support those discussions with measurable evidence.
The naive e-commerce checkout is a simple five-component linear architecture: Shopper Client, Load Balancer, Monolith Service, PostgreSQL Database, and Redis Cache (for product catalog only). There is no message queue, no separate inventory service, no async payment processing, and no saga orchestration.
All traffic enters through the Load Balancer, which distributes requests across Monolith Service pods using round-robin. The Load Balancer adds approximately 1.5ms of routing latency and handles up to 15,000 RPS — well above the 10,000 peak — so it is never the bottleneck. The database will saturate long before the load balancer does.
The Monolith Service handles everything: product browsing, cart validation, inventory management, payment processing, and order creation. For product browsing (70% of traffic), it reads from the Redis product cache (85% hit rate) or falls through to PostgreSQL. This path is fast and scales well. The bottleneck is the checkout path (20% of traffic), which executes a five-step ACID transaction: BEGIN TRANSACTION → validate cart items exist → SELECT ... FOR UPDATE on inventory rows (acquires exclusive row-level locks) → UPDATE inventory SET stock = stock - quantity → call Stripe/PayPal API (200-500ms, locks held the entire time) → INSERT INTO orders → COMMIT.
The critical insight is the SELECT ... FOR UPDATE pattern. This PostgreSQL command acquires an exclusive lock on the selected rows, preventing any other transaction from reading or modifying them until the current transaction commits or rolls back. When payment processing takes 350ms on average, each checkout holds these locks for 350ms. With 100 concurrent checkouts on the same SKU, the 100th transaction waits 35 seconds just for its lock — completely unacceptable.
PostgreSQL stores four tables: products (catalog), inventory (stock counts), orders (order records), and payments (payment records). All four tables live on a single database instance with no read replicas. Product browsing queries (GROUP BY category, full-text search) compete with checkout writes (UPDATE inventory, INSERT orders) for the same I/O bandwidth, buffer pool, and connection pool. Under flash-sale load, checkout transactions monopolize the connection pool, starving product browsing — shoppers cannot even browse products while others are checking out.
The architecture has zero redundancy at the data layer. If PostgreSQL fails, the entire system is down — no browsing, no checkout, no order status. There is no deduplication mechanism for payment retries (a network timeout during the Stripe call could lead to a double-charge if the shopper retries), no circuit breaker for payment provider outages (the system keeps trying and holding locks), and no async processing of any kind.
This sequence diagram traces the life of a single checkout from the shopper's click to the database commit. The critical insight is that the database transaction spans the entire flow, including the external payment API call (200-500ms). Row-level locks acquired by SELECT ... FOR UPDATE are held for the full duration. Any other transaction attempting to checkout the same SKU blocks until this one completes.
The non-linear degradation under concurrency is the key teaching moment: at 10 concurrent checkouts on the same SKU, the last one waits ~3.5 seconds. At 100 concurrent, it waits ~35 seconds. The system breaks not because any single component fails, but because the lock serialization creates a queue that grows linearly with concurrency.
Step-by-Step Walkthrough
Pseudocode
// Checkout handler — THE anti-pattern: payment inside DB transaction
async function handleCheckout(cart_items, payment_method):
tx = await db.beginTransaction()
try:
// Step 1: Acquire exclusive row locks on inventory
inventory = await tx.execute(
"SELECT * FROM inventory WHERE sku_id = ANY($1) FOR UPDATE",
[cart_items.map(i => i.sku_id)]
) // Blocks if another txn holds locks on these rows
// Step 2: Validate stock availability
for item in cart_items:
if inventory[item.sku_id].stock < item.quantity:
await tx.rollback() // Releases locks
return 409 // Insufficient stock
// Step 3: Decrement stock (still holding locks)
for item in cart_items:
await tx.execute(
"UPDATE inventory SET stock = stock - $1 WHERE sku_id = $2",
[item.quantity, item.sku_id]
) // ~50ms total for batch
// Step 4: Charge payment — 200-500ms WHILE HOLDING DB LOCKS
payment = await stripe.charge({
amount: cart_total,
payment_method: payment_method
}) // This is the scalability killer
// Step 5: Create order + payment records
order = await tx.execute(
"INSERT INTO orders (id, items, total, payment_id, status)
VALUES ($1, $2, $3, $4, 'CREATED')", [...])
await tx.execute(
"INSERT INTO payments (id, order_id, amount, provider_txn_id)
VALUES ($1, $2, $3, $4)", [...])
// Step 6: Commit — releases all row locks
await tx.commit()
return 200 // ~800ms at low load, 30+ seconds under contention
catch (error):
await tx.rollback() // Releases locks, restores inventory
return 500All four tables live on a single PostgreSQL instance. The inventory table is the contention hotspot — its rows are locked via SELECT ... FOR UPDATE during every checkout. The orders and payments tables are write-once per checkout. The products table is read-heavy and mostly served from the Redis cache.
The critical observation is that there is no reservation table, no outbox table, no saga_state table — the ACID transaction eliminates the need for any of these intermediate state management mechanisms. The simplicity is appealing but comes at the cost of scalability.
Step-by-Step Walkthrough
Choice
All checkout steps in one PostgreSQL transaction with SELECT ... FOR UPDATE
Rationale
This is the simplest possible consistency model: if any step fails, the entire transaction rolls back. No compensating transactions, no eventual consistency, no orphaned state. The trade-off is that row-level locks are held for the entire transaction duration (200-500ms due to payment processing), which serializes all concurrent checkouts for the same SKU and creates catastrophic lock contention under flash-sale traffic.
Choice
Call Stripe/PayPal API within the database transaction
Rationale
Keeping payment inside the transaction means the system never decrements inventory without a successful payment, and never charges payment without creating an order. There is zero risk of inconsistency. The cost is enormous: a 200-500ms external API call holds database locks for 10-20x longer than a typical database operation, reducing effective throughput by an order of magnitude.
Choice
Read inventory directly from PostgreSQL, not Redis
Rationale
Inventory must be strongly consistent to prevent overselling. A cached inventory count could show stock available when the database is actually at zero, leading to an oversell that requires manual refunds and customer service intervention. The trade-off is that every checkout hits the database for inventory reads, adding latency and load to the already-contended database.
Choice
One service handles browsing, checkout, and order management
Rationale
A monolith eliminates distributed systems complexity: no service discovery, no RPC, no distributed tracing. One deployment unit, one log stream, one health check. The cost is that all workloads share the same thread pool and connection pool — a flash-sale checkout spike starves product browsing, and a slow product query can delay checkout responses.
Target RPS
~100 orders/min (ceiling)
Latency (p99)
800ms-3s per checkout (p99)
Storage
~200 GB/year at moderate traffic
Availability
~99% (single DB instance, no redundancy)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Checkout (ACID transaction) | O(k) per checkout where k = items in cart | O(1) per transaction (row-level locks) | Lock hold time is O(1) per SKU but payment processing adds 200-500ms constant. Total lock time = k * lock_acquisition + payment_latency. |
| Product browsing (cache hit) | O(1) Redis GET | O(n) for n cached products | 85% hit rate. Miss triggers O(log n) B-tree index scan on PostgreSQL. |
| Flash sale contention | O(c * payment_latency) where c = concurrent checkouts per SKU | O(c) lock queue entries | Non-linear degradation: at c=100, wait time = 100 * 350ms = 35 seconds. Connection pool exhaustion at c > pool_size. |
Stock counts per SKU with row-level locking for concurrent access control. The SELECT ... FOR UPDATE pattern acquires exclusive locks during checkout, preventing overselling but serializing all concurrent transactions for the same SKU. At 200-500ms lock hold time per checkout (payment processing), this table is the primary bottleneck.
Indexes: PK on sku_id, idx_product ON (product_id)
SELECT ... FOR UPDATE on sku_id during checkout holds row-level lock for 200-500ms. At 100+ concurrent checkouts on the same SKU, lock wait time exceeds 30 seconds.
Order records created within the same ACID transaction as inventory decrement and payment charge. Each row represents a completed checkout with all line items stored as JSONB. The order is only visible after the transaction commits — there is no intermediate 'pending' state.
Indexes: PK on order_id, idx_user_orders ON (user_id, created_at DESC)
Status transitions: CREATED (on commit) → SHIPPED → DELIVERED. No PENDING state because the transaction either commits fully or rolls back.
Payment records from Stripe/PayPal created within the checkout transaction. The payment API call happens while database locks are held, making this the most expensive step in the transaction. No idempotency key means a network timeout during payment can lead to double-charging if the shopper retries.
Indexes: PK on payment_id, idx_order_payment ON (order_id)
No idempotency_key column — a gap that the Saga variant addresses. Double-charge risk on network timeout + retry.
Product catalog table. Read-heavy workload served mostly from Redis cache (85% hit rate). Changes infrequently (admin updates). Not involved in checkout lock contention.
Indexes: PK on product_id, idx_category ON (category_id)
85% of reads served from Redis cache with 300s TTL. DB fallback for cache misses and cold starts.
Payment provider (Stripe) goes down for 30 seconds
Impact
All checkout transactions hang for 30 seconds holding database locks. Connection pool exhausts within 5 seconds. Product browsing degrades because all connections are held by pending checkouts. After timeout, all transactions roll back simultaneously, causing a thundering herd of retries.
Mitigation
The Saga variant moves payment to an async worker with its own connection pool and circuit breaker. Payment outages affect only the payment step, not inventory or browsing.
Flash sale: 1,000 users checkout the same limited SKU simultaneously
Impact
SELECT ... FOR UPDATE serializes all 1,000 transactions. With 350ms average lock hold time, the 1,000th user waits ~350 seconds. PostgreSQL lock_timeout (30s) causes ~970 transactions to fail. Only ~85 orders succeed per 30-second window. Users see timeout errors and retry, amplifying the problem.
Mitigation
The Saga variant uses reservation-based inventory with TTL — no row-level locks held during payment. Queue-based admission control limits concurrent checkouts per SKU.
Network timeout during payment API call leads to double-charge
Impact
The checkout transaction times out and rolls back (inventory restored, no order created). The shopper retries. But the first payment actually succeeded at Stripe — the timeout was on the response, not the request. The retry creates a second charge. The customer is billed twice with only one order.
Mitigation
The Saga variant uses idempotency keys on every payment request. Stripe's idempotency API ensures the second call returns the original result without creating a new charge.
PostgreSQL primary crashes during peak checkout traffic
Impact
Total system outage — no browsing, no checkout, no order status. All in-flight transactions are lost and must be retried by users. No read replicas means zero read availability. Recovery depends on RDS automated backup restoration (5-30 minutes).
Mitigation
The Saga variant uses per-service databases with replicas. A single DB failure affects only one service (e.g., inventory). Other services continue operating. Saga orchestrator retries the failed step when the DB recovers.
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| PostgreSQL Database | Connection pool exhaustion | All requests (browse + checkout) fail with connection timeout. System-wide outage. | PgBouncer connection pooler, separate pools for read/write, but fundamentally requires moving to async architecture. |
| PostgreSQL Database | Row-level lock contention on hot SKU | Checkout latency degrades non-linearly. At 100 concurrent, p99 > 30s. At 500 concurrent, most transactions time out. | Reservation-based inventory (Saga variant) eliminates lock contention by using TTL-based soft locks instead of DB row locks. |
| Monolith Service | Thread pool exhaustion from long-held checkout transactions | Product browsing requests queue behind blocked checkout threads. All service endpoints degrade. | Separate services for browsing and checkout (Saga variant). Independent thread pools and scaling. |
| Payment Provider (Stripe/PayPal) | Timeout or elevated latency (>1s) | Database locks held for extended duration. Lock contention amplified. Cascading failures across all checkouts. | Async payment processing via Kafka worker (Saga variant). Circuit breaker on payment calls. |
Vertical scaling of PostgreSQL (larger instance) is the primary lever. Move from db.r7g.xlarge to db.r7g.2xlarge for 2x connections and I/O. Add read replicas for product browsing queries (offload 70% of traffic). Horizontal scaling of the monolith service (more ECS pods) helps with CPU-bound product browsing but does not solve database lock contention. The fundamental ceiling is ~100 orders/minute per hot SKU due to row-level lock serialization. Beyond this, the architecture must change — which is exactly what the Saga Orchestrator variant does.
Key metrics to monitor: (1) PostgreSQL lock wait time per transaction — alert at p99 > 1 second, (2) connection pool utilization — alert at > 80%, (3) checkout p99 latency — alert at > 3 seconds, (4) payment provider response time — alert at p99 > 1 second, (5) inventory stock levels for flash-sale SKUs — alert when approaching zero. Dashboard should show real-time lock contention heatmap by SKU, connection pool usage over time, and checkout success/failure rate split by error type (lock timeout, connection timeout, payment declined, insufficient stock).
Minimal infrastructure cost due to simplicity: single RDS db.r7g.xlarge (~$350/month), single ElastiCache Redis cache.r7g.large (~$150/month), 4 ECS Fargate pods at 4 vCPU/8 GB (~$400/month), ALB (~$25/month). Total: ~$925/month. This is the cheapest variant by far. However, the cost of flash-sale failures is substantial: lost revenue from failed checkouts, customer service costs for double-charges, and reputational damage from timeout errors during the busiest shopping periods. The Saga variant costs ~3x more in infrastructure (~$2,800/month) but handles 100x more concurrent checkouts.
PCI compliance is the primary concern. The monolith handles payment tokens (Stripe/PayPal payment method IDs) in the same process that serves product browsing — expanding PCI scope to the entire application. All code, dependencies, and infrastructure must meet PCI DSS requirements. Recommendation: at minimum, use tokenized payment methods (never handle raw card numbers), enforce TLS everywhere, and encrypt payment_id columns at rest. The Saga variant isolates payment into a dedicated microservice, reducing PCI scope to a single service boundary.
Single monolith deployment: build Docker image, run ECS rolling update (one pod at a time). Rollback: revert to previous task definition. Database migrations run before service deploy (forward-compatible schema changes only). Zero-downtime deployment requires at least 2 healthy pods during rolling update. No blue/green needed at this scale — the simplicity of a single service makes rolling updates sufficient. Canary deployments are not practical with a monolith (cannot route 10% of traffic to a new version without a load balancer feature).
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| Naive (Monolith + ACID) | T1 | 800ms-3s p99 | ~100 orders/min | ~$925/month | Low | 99% (single DB) |
| Saga Orchestrator | T2 | 200-500ms p99 | ~2,500 orders/min | ~$1,800/month | Medium | 99.9% |
| Distributed Saga + Outbox | T4 | < 3s end-to-end | 10K+ orders/min | ~$2,800/month | Very High | 99.99% |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
E-commerce checkout combines several hard problems: distributed transactions (inventory + payment + order atomicity), concurrent access control (flash-sale traffic on limited stock), payment idempotency (preventing double-charges), and saga compensation (undoing partial work on failure). Companies like Amazon, Shopify, Stripe, and Square ask this because it directly maps to their core business logic and exposes candidates' understanding of consistency, concurrency, and failure handling.
With SELECT ... FOR UPDATE, the first transaction acquires the row lock on the popular SKU. The next 999 transactions queue behind it, each waiting for the lock. Since each transaction holds the lock for ~350ms (payment processing time), the 1,000th transaction waits approximately 350 seconds — nearly 6 minutes. In practice, PostgreSQL's lock_timeout (typically 30 seconds) causes most transactions to fail with a lock timeout error, and the shopper sees an error page. Only ~85 orders succeed per 30-second window on a single SKU.
Vertical scaling (bigger CPU, more RAM) helps with connection pool limits and query throughput but does not solve the fundamental lock contention problem. Row-level locks are serialized regardless of hardware — two transactions cannot both hold an exclusive lock on the same row simultaneously. A db.r7g.16xlarge handles more concurrent connections but the lock wait time for the same SKU is identical to a db.r7g.large. The solution must change the architecture, not the hardware.
Migrate when checkout p99 latency exceeds your SLO (typically 3 seconds), when database lock wait time exceeds 1 second during normal peaks, or when you see connection pool exhaustion errors in your logs. In this simulation, the inflection point is around 100 orders/minute. Below that, the naive approach is simpler and cheaper to operate. Above that, you need to move payment processing out of the transaction — which is exactly what the Saga Orchestrator variant does.
PCI DSS requires that payment processing be isolated from other system components to limit the scope of cardholder data exposure. Embedding payment API calls inside a monolithic service that also handles product browsing and order management expands the PCI compliance scope to the entire application. Separating payment into its own service (as in the Saga variant) reduces PCI scope to just the payment microservice, simplifying audits and reducing compliance costs.
Sign in to join the discussion.
Ready to design your own E-Commerce Checkout?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator