1What happens in a saga when Step 3 of a 5-step process fails?
The Saga pattern manages distributed transactions in microservices by breaking them into a sequence of local transactions, each with a compensating action (undo). If a step fails, previously completed steps are compensated in reverse order. Sagas provide eventual consistency without the blocking behavior of 2PC.
The Saga pattern, originally proposed by Hector Garcia-Molina and Kenneth Salem in 1987 for long-lived transactions in databases, has been adopted as the standard approach for distributed transactions in microservices architectures. The core insight is that a distributed transaction spanning multiple services can be decomposed into a sequence of local transactions, each confined to a single service's database. If all steps succeed, the saga completes successfully. If any step fails, the saga executes compensating transactions in reverse order to undo the effects of previously completed steps.
There are two coordination styles. Choreography-based sagas use event-driven communication: each service performs its local transaction and publishes an event (e.g., 'OrderCreated'), which triggers the next service (e.g., Payment service subscribes to 'OrderCreated' and processes payment). If a step fails, the failing service publishes a failure event, and previously completed services subscribe to that event and execute their compensating transactions. Choreography is decentralized and loosely coupled but can become hard to understand and debug as the number of services grows.
Orchestration-based sagas use a central coordinator (saga orchestrator) that tells each service what to do via commands. The orchestrator maintains the saga's state machine -- which steps have completed, which is current, and what to compensate if a failure occurs. If a step fails, the orchestrator invokes compensating transactions in reverse order. Orchestration is centralized and easier to understand but introduces a coordinator dependency. Frameworks like Temporal, Cadence (Uber), and Axon make orchestration implementation straightforward.
Sagas provide eventual consistency, not atomicity. During saga execution, intermediate states are visible to other transactions. For example, after the order is created but before payment is processed, the order exists in a 'pending' state. If the payment fails and the saga compensates, the order is cancelled -- but between creation and cancellation, other services could read the now-invalid order. This requires careful handling: semantic locks (marking resources as 'processing'), countermeasures (reads filtering out pending items), or idempotent compensation (ensuring undoes can be safely re-applied).
E-Commerce Order Saga
A customer places an order. Step 1: Order Service creates an order (compensate: cancel order). Step 2: Payment Service charges the credit card (compensate: refund payment). Step 3: Inventory Service reserves items (compensate: release reservation). Step 4: Shipping Service schedules delivery (compensate: cancel shipment). If Step 3 fails (items out of stock), the saga compensates: refund the payment (Step 2 undo), cancel the order (Step 1 undo). The customer sees 'Order cancelled -- items out of stock.' The saga ensures the system returns to a consistent state without 2PC.
Uber (Cadence / Temporal)
Uber built Cadence (now open-sourced as Temporal) to orchestrate complex, long-running business processes across hundreds of microservices. A ride-hailing saga involves: match rider to driver, authorize payment hold, start trip, complete trip, charge payment, update driver earnings. Each step is a Temporal workflow activity with a compensating activity. If payment authorization fails, Temporal automatically executes compensations. The workflow state is durably persisted, surviving process crashes.
Netflix
Netflix uses choreography-based sagas for its content delivery pipeline. When a new title is ingested, events flow through encoding, quality-check, metadata-tagging, and CDN-distribution services. Each service publishes completion events that trigger the next step. If quality-check fails, a compensation event triggers deletion of the encoded assets and metadata cleanup. Netflix's event-driven architecture and robust message bus (Kafka) make choreography natural at their scale.
Axon Framework (CQRS + Sagas)
Axon Framework provides built-in support for saga orchestration in Java/Kotlin applications. Developers define saga classes with event handlers for each step and compensating handlers for failures. Axon persists saga state, handles retries, and manages the lifecycle. It integrates with Axon Server for event routing and saga coordination. Many banking and insurance systems use Axon sagas for multi-step business processes like loan approval and claims processing.
| Aspect | Description |
|---|---|
| Eventual Consistency vs Atomicity | Sagas provide eventual consistency -- the system converges to a consistent state after all steps (or compensations) complete. But intermediate states are visible, unlike 2PC where the transaction is atomic (all or nothing, visible only after commit). Applications must handle the visibility of partial saga states. |
| Choreography vs Orchestration | Choreography is decentralized, loosely coupled, and naturally scalable. But with 5+ services, the event flow becomes hard to trace and debug ('event spaghetti'). Orchestration centralizes the flow in a coordinator, making it explicit and debuggable, but introduces a dependency and potential bottleneck. |
| Compensation Complexity | Not all operations are easily compensable. Sending a notification, calling an external API, or publishing to a partner system may have irreversible side effects. Design sagas to perform irreversible steps last, and accept that some compensations are approximations (e.g., send a 'correction' notification rather than truly unsending the original). |
| Saga State Management | Orchestration-based sagas require durable state persistence (which step is current, which have completed, what are the compensation parameters). This adds storage and operational overhead. Workflow engines (Temporal, Cadence) handle this automatically but add infrastructure dependencies. |
Uber's Cadence: From Fragile Scripts to Durable Workflows
Scenario
Uber's microservices architecture required coordinating complex multi-step business processes: ride booking (match, authorize, trip, charge), driver onboarding (background check, vehicle inspection, documentation), and payment processing (charge, split, transfer). Early implementations used ad-hoc scripts with retry logic, cron jobs, and database flags. These were fragile: a server crash mid-process left the saga in an inconsistent state, requiring manual intervention.
Solution
Uber built Cadence (later forked to Temporal) -- a durable workflow engine that provides saga orchestration as a core primitive. Developers write workflow code in Go or Java that looks like normal sequential code, but Cadence durably persists every decision point. If the workflow server crashes, it replays the workflow from the last checkpoint, re-invoking activities only if they have not yet completed. Compensation is expressed as normal code: a try/catch that triggers compensating activities on failure.
Outcome
Cadence replaced thousands of fragile cron jobs and ad-hoc retry mechanisms across Uber's microservices. Workflow reliability improved from ~99% (manual escalations for the 1% that failed) to 99.99%. The average saga involving 5 services completes in 2-5 seconds with automatic compensation on failure. Temporal (the open-source fork) has been adopted by Stripe, Netflix, Snap, and Datadog for similar saga orchestration needs.
See Saga Pattern in action
Explore system design templates that use saga pattern and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What happens in a saga when Step 3 of a 5-step process fails?
2What is the key disadvantage of sagas compared to 2PC?