Vetora logo
Hard11 componentsInterview: Very High

Payment System — Transaction Processing

Design a Stripe-like payment processing system with idempotent charges, sharded ledger, saga orchestration, and async card network integration handling 100K TPS at peak.

TransactionsSagaIdempotencyFinancial
Problem Statement

Payment system design is one of the highest-stakes system design interview questions because it combines strict correctness requirements with high-throughput demand. Unlike most distributed systems where occasional data loss or duplication is tolerable, a payment system must guarantee that no customer is ever double-charged, every transaction is recorded in an auditable ledger, and refunds are processed reliably. This makes it a favorite question at fintech companies, FAANG payment teams, and any organization that processes financial transactions at scale.

At production scale, a platform like Stripe or PayPal processes hundreds of thousands of charge transactions per second during peak events such as Black Friday or Singles' Day. The charge API must return a response in under 200 milliseconds, yet the actual card network authorization (Visa, Mastercard, Amex) takes 200 to 500 milliseconds and is inherently unreliable. This fundamental tension between API latency expectations and downstream processing time forces an asynchronous architecture where the charge is acknowledged immediately and card network resolution happens out-of-band.

The correctness challenges are multifaceted. Idempotency must be guaranteed at the API layer — if a merchant retries a charge due to a network timeout, the system must detect the duplicate and return the original result rather than creating a second charge. The ledger must maintain double-entry accounting integrity: every debit has a corresponding credit, and the sum of all entries must always balance. Saga orchestration must handle partial failures gracefully — if the card network declines a charge after the ledger has recorded it as pending, compensating transactions must reverse the state without leaving the system inconsistent.

Beyond core processing, interviewers expect candidates to address PCI-DSS compliance (card data isolation), settlement reconciliation (matching charges against bank settlement files on T+2 days), webhook delivery to merchants with at-least-once semantics, and fraud detection integration. This template provides a comprehensive foundation for discussing all of these concerns in an interview setting.

Architecture Overview

The payment system architecture centers on an idempotent sharded ledger with saga orchestration for distributed transaction management. The flow begins when a merchant sends a charge request with a client-supplied idempotency key through the API Gateway, which authenticates merchant API keys, enforces per-merchant rate limits to prevent card-testing attacks, and routes to the Main Load Balancer. The load balancer distributes traffic between ChargeService (handling charges, status checks, and reports) and RefundService (handling refunds and webhook replays).

The critical charge path executes three steps in sequence. First, ChargeService checks the IdempotencyCache (Redis) using an atomic SETNX operation. If the idempotency key already exists, the previous result is returned immediately — this is the at-most-once charging guarantee. If the key is new, it is stored in Redis before any processing begins, ensuring that even if the request is retried while the first attempt is still in-flight, only one charge is created. Second, ChargeService writes the charge record to LedgerDB (PostgreSQL, sharded by merchant_id across 64 partitions) with status "pending." This ledger write is the single source of truth for all financial records, using strong consistency with synchronous replication to two replicas. Third, ChargeService publishes a "charge_created" event to PaymentStream (Kafka with 64 partitions) for async card network processing, and returns the payment ID with "pending" status to the merchant.

The async payment pipeline is powered by three specialized workers consuming from Kafka. CardNetworkWorker (30 instances) routes charges to the appropriate card network (Visa, Mastercard, Amex) based on the card BIN, handling the 200-500ms network calls with circuit breakers per network for fault isolation. On authorization result, it updates the ledger status and publishes a "payment_result" event. SettlementWorker (5 instances) performs daily batch reconciliation, matching charges against card network settlement files and flagging discrepancies. WebhookWorker (15 instances) delivers payment events to merchant webhook URLs with at-least-once delivery, exponential backoff retry up to 72 hours, and HMAC-SHA256 signatures for merchant verification.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Idempotency Strategy

Choice

Client-supplied idempotency keys stored in Redis SETNX before processing

Rationale

Network timeouts between the merchant and the payment API cause retries that could result in double charges. By storing the idempotency key in Redis atomically (SETNX) before any charge processing begins, a retry during processing finds the key and returns the existing result. This is the same pattern Stripe uses in production. Redis provides sub-millisecond lookup on the critical path, saving over a billion milliseconds of aggregate latency per day compared to a database-based approach at 100K RPS.

Ledger Sharding

Choice

PostgreSQL sharded by merchant_id across 64 partitions

Rationale

Payment queries are almost always scoped to a single merchant (list charges, check balance, generate statements). Sharding by merchant_id means most queries hit a single shard, avoiding expensive cross-shard joins. At 100K TPS across 64 shards, each shard handles approximately 1,500 TPS — well within PostgreSQL's capacity. Strong consistency with synchronous replication ensures the ledger is never in an inconsistent state, which is a non-negotiable requirement for financial systems.

Async Card Network Integration

Choice

Kafka-based event pipeline with dedicated CardNetworkWorker consumers

Rationale

Card network authorization calls take 200-500ms and can timeout unpredictably. Making them synchronous would push charge API latency above 500ms and tightly couple system availability to card network uptime. Async processing via Kafka means the charge API returns in approximately 70ms with status "pending," and the merchant receives the final result via webhook. This decoupling is how Stripe and Adyen achieve their sub-second charge API latency in production.

Service Separation

Choice

Dedicated ChargeService and RefundService with independent scaling

Rationale

Charges and refunds have fundamentally different scaling profiles (100K vs 5K RPS), failure semantics (charge failure means do not charge; refund failure means must retry), and compliance requirements (charges touch card tokens; refunds reference existing payment IDs only). Separating them enables independent scaling, isolated deployment, distinct PCI scope boundaries, and prevents refund processing from competing with charge throughput during peak load.

Scale & Performance

Target RPS

100,000 peak (50K charges, 30K status, 5K refunds, 15K other)

Latency (p99)

<200ms (charge API); 200-500ms (card network, async)

Storage

~5 TB/year (ledger + audit logs)

Availability

99.99%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How do you prevent double-charging in a distributed payment system?

Double-charge prevention relies on client-supplied idempotency keys stored in Redis using the SETNX (set-if-not-exists) atomic operation. Before any charge processing begins, the system checks whether the idempotency key already exists. If it does, the previous result is returned immediately without creating a new charge. If it does not exist, the key is stored atomically before proceeding. This guarantees at-most-once charging even when merchants retry due to network timeouts, because the SETNX operation is atomic and the key is persisted before any side effects occur.

What is the saga pattern and why is it used in payment processing?

The saga pattern is a distributed transaction technique where a long-running transaction is decomposed into a sequence of local transactions, each paired with a compensating action for rollback. In payment processing, the saga coordinates the charge lifecycle: ledger write, card network authorization, settlement, and webhook delivery. If the card network declines a charge after the ledger records it as pending, the saga executes compensating transactions — updating the ledger status to failed, releasing any merchant balance holds, and sending a payment_failed webhook. Each compensating action is idempotent to handle the case where the saga orchestrator crashes mid-rollback and restarts.

Why shard the payment ledger by merchant_id instead of payment_id?

Payment operations are overwhelmingly scoped to a single merchant: listing charges, computing balances, generating statements, and running reports all filter by merchant. Sharding by merchant_id ensures these queries hit a single shard without cross-shard coordination. If sharded by payment_id, a merchant's statement query would need to scatter-gather across all 64 shards, dramatically increasing latency and resource consumption. The trade-off is that very large merchants may create hot shards, which can be addressed by sub-sharding within a merchant's partition.

How does a payment system handle card network outages?

Card network integration uses circuit breakers per network (Visa, Mastercard, Amex) for fault isolation. When consecutive failures to a specific network exceed a threshold, the circuit opens and charges for that network are held in the Kafka queue rather than being attempted and failing. The circuit half-opens periodically to test whether the network has recovered. Meanwhile, charges for other networks continue processing normally. This prevents a Visa outage from affecting Mastercard transactions and avoids overwhelming a degraded network with retry storms.

How do payment webhooks guarantee delivery to merchants?

Webhook delivery uses at-least-once semantics with exponential backoff retry over a 72-hour window. Each delivery attempt includes an HMAC-SHA256 signature header so merchants can verify authenticity. If the merchant endpoint returns a 2xx status, delivery is marked successful. On failure (timeout, 5xx, connection refused), the worker retries with increasing intervals (1s, 2s, 4s, 8s, up to hours between attempts). After exhausting the retry window, undeliverable events are moved to a dead letter queue. Merchants must deduplicate by event_id since at-least-once delivery means the same event may arrive more than once.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Payment System?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator