Vetora logo
🔄Architectural Patterns

Saga Pattern

Learn how the saga pattern manages distributed transactions across microservices using a sequence of local transactions with compensating actions, avoiding the need for distributed ACID transactions.

Overview

In a monolithic application, a business operation like 'place an order' can be a single ACID transaction: debit the customer's account, reserve inventory, and create the order record all succeed or all fail atomically. In a microservices architecture where each service owns its own database, this ACID guarantee disappears. The Payment Service, Inventory Service, and Order Service each have their own database, and there is no way to wrap a transaction across all three without distributed two-phase commit (2PC), which is slow, brittle, and incompatible with most modern datastores.

The saga pattern, originally described by Hector Garcia-Molina and Kenneth Salem in 1987, solves this problem by decomposing the distributed transaction into a sequence of local transactions, each executed within a single service's database. If the payment succeeds but inventory reservation fails, a compensating transaction refunds the payment. Each step either advances the saga forward or triggers compensating actions to undo all previously completed steps, bringing the system back to a consistent state.

There are two coordination approaches. In choreography-based sagas, each service listens for events and reacts independently. The Order Service publishes 'OrderCreated', the Payment Service hears it and charges the customer, then publishes 'PaymentCompleted', the Inventory Service hears that and reserves stock. If any step fails, the failing service publishes a failure event that triggers compensating actions in upstream services. Choreography works well for simple sagas with 3-4 steps but becomes difficult to understand and debug as the number of steps grows because the flow logic is distributed across services.

Orchestration-based sagas use a central saga coordinator (orchestrator) that explicitly defines the step sequence. The orchestrator tells each service what to do and what to do if it fails, like a workflow engine. The Order Saga Orchestrator calls the Payment Service, waits for the result, calls the Inventory Service, waits for the result, and if any step fails, it sends compensating commands to services that have already completed their steps. Orchestration is easier to understand, test, and monitor because the entire flow is visible in one place, but it introduces a coordination point that can become a bottleneck or single point of failure.

Key Points
  • 1Each saga step is a local ACID transaction within a single service's database. The saga as a whole provides eventual consistency, not ACID atomicity. Between steps, the system is in a partially-completed intermediate state.
  • 2Compensating transactions must be defined for every step except the last. A compensation undoes the semantic effect of a step (e.g., refund a payment, release reserved inventory) but does not roll back the database transaction itself.
  • 3Choreography-based sagas coordinate through events with no central controller. Each service reacts to events and publishes its own. This is decoupled but hard to understand beyond 3-4 steps because the flow logic is scattered.
  • 4Orchestration-based sagas use a central coordinator that defines the step sequence explicitly. The orchestrator calls each service and handles failures. This is easier to reason about but introduces a coordination dependency.
  • 5All saga participants must be idempotent because network failures can cause duplicate deliveries. A service must produce the same result whether it processes a step once or multiple times.
  • 6Semantic locks or 'pending' states are needed to handle concurrent operations during saga execution. An order in the middle of a saga should be marked as 'PENDING' so that concurrent modifications are rejected until the saga completes.
Simple Example

The Travel Booking Analogy

Booking a vacation involves three separate services: flight booking, hotel reservation, and car rental. In a saga, you book the flight first (step 1), then reserve the hotel (step 2), then rent the car (step 3). If the car rental fails because no cars are available, you need to compensate: cancel the hotel reservation (undo step 2) and cancel the flight booking (undo step 1). Each booking system is independent with its own database -- there is no way to wrap all three in a single transaction. The saga ensures that either all three bookings succeed, or any completed bookings are cancelled, leaving you in a consistent state (no partial vacation bookings).

Real-World Examples

Uber

Uber uses an orchestration-based saga (built on their Cadence/Temporal workflow engine) for trip processing. The saga coordinates rider matching, driver notification, fare calculation, payment processing, and receipt generation. If payment fails after a completed ride, a compensating action adjusts the rider's balance and notifies the driver about payment issues. The orchestrator processes millions of trip sagas daily with full visibility into each step's status.

Airbnb

Airbnb's booking system uses sagas to coordinate across the Reservation Service, Payment Service, Host Notification Service, and Calendar Service. When a guest books a stay, the saga reserves the dates, processes the payment, and notifies the host. If the host declines within the acceptance window, compensating actions refund the payment and release the calendar dates. Their saga orchestrator handles approximately 1 million booking attempts per day across 220 countries.

Doordash

Doordash uses Cadence (now Temporal) to orchestrate order fulfillment sagas. A single order saga coordinates across 10+ services: order validation, restaurant confirmation, driver assignment, payment authorization, delivery tracking, and final settlement. Each step has a timeout and compensating action. If the restaurant cannot fulfill the order after payment is authorized, the saga triggers a payment reversal and customer notification in the correct order.

Trade-Offs
AspectDescription
Availability vs ConsistencySagas avoid distributed locks, allowing each service to remain available and responsive even during multi-step transactions. The trade-off is that the system is in an inconsistent intermediate state between steps. A customer might briefly see a debited account before inventory is confirmed, requiring careful UX to manage expectations.
Choreography vs OrchestrationChoreography is more decoupled and avoids a central coordinator, but the flow logic is scattered across services, making it hard to understand, test, and debug. Orchestration centralizes the flow logic for visibility and testability, but introduces a coordination point that all participants depend on.
Simplicity vs Compensation ComplexitySagas avoid the complexity of distributed 2PC transactions, but compensating transactions can be surprisingly difficult to implement. Some operations are not easily reversible (e.g., sending an email, shipping a package). Semantic compensation ('send a cancellation email') is often the only option, and designing these for all failure scenarios requires careful domain analysis.
Scalability vs ObservabilityChoreography-based sagas scale well because there is no central bottleneck, but observing the state of a saga requires correlating events across multiple services. Orchestration-based sagas have a single point of visibility (the orchestrator's state machine) but the coordinator must scale to handle all saga instances.
Case Study

Order Processing Saga at a Large Retailer

Scenario

A major online retailer's order processing involved five services: Order, Payment, Inventory, Shipping, and Loyalty. The original design used synchronous REST calls in a chain: Order -> Payment -> Inventory -> Shipping -> Loyalty. If any downstream service was slow or unavailable, the entire chain blocked. Payment timeouts caused duplicate charges, and partial failures left orders in inconsistent states -- charged but not shipped, or shipped but not charged. Error rates during peak traffic reached 5%, resulting in thousands of customer complaints daily.

Solution

The team implemented an orchestration-based saga using a dedicated Saga Orchestrator service backed by a durable state machine (Temporal). Each step became an asynchronous operation with explicit timeouts, retry policies, and compensating actions. The Payment step had a 'refund' compensator, the Inventory step had a 'release reservation' compensator, and the Shipping step had a 'cancel shipment' compensator. All participants were made idempotent using idempotency keys. The orchestrator persisted its state durably, so in-flight sagas could survive orchestrator restarts.

Outcome

Order processing error rates dropped from 5% to 0.01%. During peak traffic events (Black Friday), the saga-based system processed 50,000 orders per minute with full consistency guarantees. Partial failures were automatically compensated within seconds, eliminating the backlog of inconsistent orders that previously required manual intervention. The orchestrator's state machine provided full visibility into every order's saga progress, reducing customer support investigation time from 15 minutes to 30 seconds.

Common Mistakes
  • Not designing compensating actions for every saga step. Every step except the last must have a compensating action defined and tested. Discovering that a step cannot be compensated after it is in production leads to data inconsistency that requires manual intervention.
  • Ignoring idempotency. Network retries and duplicate message delivery mean saga participants will receive the same command multiple times. Without idempotency keys, this causes duplicate charges, double inventory reservations, or duplicate shipments.
  • Using choreography for complex sagas with many steps. Beyond 4-5 steps, choreography-based sagas become nearly impossible to reason about because the flow logic is distributed. Switch to orchestration when the step count or branching logic exceeds the team's ability to mentally trace the event flow.
  • Treating saga intermediate states as invisible to users. During a saga, the system is in a partially-completed state. If a user queries their order during this window, they might see inconsistent data. Use explicit 'PENDING' or 'PROCESSING' states in the UI to set expectations.
Related Concepts

See Saga Pattern in action

Explore system design templates that use saga pattern and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Orchestrate a 5-step checkout saga with compensating transactions

Metrics to watch
saga_completion_ratecompensation_trigger_pctend_to_end_latency_msthroughput_rps
Run Simulation
Test Your Understanding

1What is the primary difference between a saga and a distributed two-phase commit (2PC) transaction?

2Why must saga participants be idempotent?

Deeper Reading