Saga Pattern: Distributed Transactions Without Two-Phase Commit

A customer places an order. Your system needs to: reserve inventory, charge the payment, and schedule shipping. These operations span three separate microservices with three separate databases. If the payment fails after inventory is reserved, you need to release the inventory. If shipping fails after payment is charged, you need to refund the payment.

You cannot use a database transaction across three services. Two-phase commit (2PC) would work in theory, but it is slow, fragile, and a single point of failure. The Saga pattern is the practical alternative.

What a Saga is

A Saga is a sequence of local transactions. Each step in the sequence updates one service’s database and publishes an event or sends a command to trigger the next step. If a step fails, the Saga executes compensating transactions to undo the previous steps.

Key insight: Instead of one atomic distributed transaction, you have a series of smaller transactions that are individually atomic. Consistency is achieved eventually through the sequence of transactions and compensations.

Two Saga implementations

Choreography-based Saga

Services react to events. No central coordinator. Each service listens for events and decides what to do next.

How it works:

Order service creates order, publishes OrderCreated event
Inventory service listens, reserves inventory, publishes InventoryReserved event
Payment service listens, charges payment, publishes PaymentCharged event
Shipping service listens, schedules shipping, publishes ShippingScheduled event

On failure:

Payment fails: payment service publishes PaymentFailed event
Inventory service listens, releases reservation, publishes InventoryReleased event
Order service listens, marks order as failed

graph LR
subgraph choreography["Choreography Saga - Event-Driven"]
  OS["Order service"] -->|"OrderCreated"| INV["Inventory service"]
  INV -->|"InventoryReserved"| PAY["Payment service"]
  PAY -->|"PaymentCharged"| SHIP["Shipping service"]
  SHIP -->|"ShippingScheduled"| OS
  PAY -->|"PaymentFailed"| INV2["Inventory service
(compensate)"]
  INV2 -->|"InventoryReleased"| OS2["Order service
(mark failed)"]
end

style OS fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style INV fill:#E1F5EE,stroke:#0F6E56,color:#085041
style PAY fill:#E1F5EE,stroke:#0F6E56,color:#085041
style SHIP fill:#E1F5EE,stroke:#0F6E56,color:#085041
style INV2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style OS2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Pros: Simple, no central coordinator, services are loosely coupled.

Cons: Hard to track the overall state of a Saga. Difficult to debug. Cyclic dependencies can emerge. Hard to add new steps.

Orchestration-based Saga

A central coordinator (the Saga orchestrator) tells each service what to do. The orchestrator tracks the state of the Saga and handles failures.

How it works:

Order service creates order, starts the Saga orchestrator
Orchestrator sends ReserveInventory command to inventory service
Inventory service reserves inventory, replies with success
Orchestrator sends ChargePayment command to payment service
Payment service charges payment, replies with success
Orchestrator sends ScheduleShipping command to shipping service

On failure:

Payment fails: payment service replies with failure
Orchestrator sends ReleaseInventory command to inventory service
Orchestrator marks the Saga as failed

graph TB
subgraph orchestration["Orchestration Saga - Central Coordinator"]
  ORCH["Saga Orchestrator
Tracks state
Sends commands"]
  ORCH -->|"ReserveInventory"| INV3["Inventory service"]
  INV3 -->|"success"| ORCH
  ORCH -->|"ChargePayment"| PAY2["Payment service"]
  PAY2 -->|"failure"| ORCH
  ORCH -->|"ReleaseInventory
(compensate)"| INV3
  ORCH -->|"MarkFailed"| OS3["Order service"]
end

style ORCH fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style INV3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style PAY2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Pros: Clear visibility into Saga state. Easier to debug. Easier to add new steps. Business logic is centralized.

Cons: The orchestrator is a central point of coupling. Services must know about the orchestrator.

Compensating transactions

A compensating transaction undoes the effect of a previous transaction. It is not a rollback - it is a new transaction that reverses the effect.

Examples:

Reserve inventory -> Release inventory
Charge payment -> Refund payment
Send email -> Cannot unsend, but can send a cancellation email
Create shipping label -> Cancel shipping label

Not all operations can be compensated. Sending an email cannot be unsent. For these, use a “pivot transaction” - a point of no return. Operations before the pivot can be compensated. Operations after the pivot are committed.

Where it breaks or gets interesting

Saga state persistence

The orchestrator must persist its state. If it crashes mid-Saga, it must be able to resume from where it left off. Store Saga state in a database. Use the outbox pattern to ensure state updates and event publishing are atomic.

Idempotency is critical

Steps in a Saga may be retried. If the inventory service receives ReserveInventory twice (due to a retry), it should not reserve twice. Every step must be idempotent. Use idempotency keys (the Saga ID + step ID) to detect and skip duplicate requests.

Isolation: the lack thereof

Database transactions provide isolation - concurrent transactions do not see each other’s intermediate state. Sagas do not. Between steps, the data is in an intermediate state that other transactions can see.

Example: inventory is reserved (step 1) but payment has not been charged yet (step 2). Another transaction might see the reserved inventory and make decisions based on it. This is called a “dirty read” at the Saga level.

Mitigations: use semantic locks (mark records as “pending” during the Saga), use optimistic locking (check that the state has not changed before each step), or accept the inconsistency and design the business logic to handle it.

Long-running Sagas

A Saga that takes minutes or hours (e.g., a multi-step approval workflow) must handle timeouts. If a step does not complete within a timeout, the Saga should either retry or compensate. Use a timeout mechanism in the orchestrator.

Real-world systems

Eventuate Tram - Java framework for Saga orchestration. Supports both choreography and orchestration. Built-in support for the outbox pattern.

Temporal - Workflow orchestration platform. Handles Saga state persistence, retries, and timeouts automatically. Used by Uber, Netflix, and Stripe.

AWS Step Functions - Managed workflow service. Orchestrates Lambda functions and other AWS services. Supports error handling and compensation.

Axon Framework - Java framework for event-driven microservices with built-in Saga support.

Netflix Conductor - Open-source workflow orchestration engine. Used internally at Netflix for complex multi-step workflows.

How to apply it in practice

Choosing choreography vs orchestration

Use choreography when:

The workflow is simple (3-4 steps)
Services are truly independent
You want loose coupling
The team is comfortable with event-driven design

Use orchestration when:

The workflow is complex (5+ steps)
You need clear visibility into workflow state
You need to handle complex failure scenarios
You want centralized business logic

For most production systems, orchestration is easier to reason about and debug.

Designing compensating transactions

For every step in the Saga, design the compensating transaction before implementing the step. Ask: “If this step succeeds but a later step fails, how do I undo this?”

Some operations cannot be compensated (sending an email, publishing a public post). For these, place them at the end of the Saga (after the point of no return) or accept that they cannot be undone.

Testing Sagas

Test the happy path (all steps succeed) and every failure scenario (each step fails). Verify that compensating transactions correctly undo the previous steps. Use chaos engineering to inject failures and verify the Saga handles them correctly.

FAQ

Q: Is a Saga eventually consistent?

Yes. Between steps, the system is in an intermediate state. Other transactions might see this intermediate state. The system reaches a consistent state only after the Saga completes (either successfully or after all compensations). This is weaker than ACID isolation but is the practical tradeoff for distributed systems.

Q: What is the difference between a Saga and a workflow?

A Saga is specifically about managing distributed transactions with compensating actions. A workflow is a more general concept for orchestrating a sequence of steps. Sagas are a type of workflow. Workflow engines (Temporal, AWS Step Functions) can implement Sagas, but they also support workflows that do not need compensation (approval workflows, data pipelines).

Q: How do you handle a Saga that is stuck (a step never completes)?

Use timeouts. If a step does not complete within a timeout, the orchestrator retries or compensates. For external services that might be slow, use a generous timeout (minutes, not seconds). For steps that should complete quickly, use a short timeout. Monitor stuck Sagas and alert when they exceed a maximum duration. Provide a manual intervention mechanism for Sagas that cannot be automatically resolved.

Interview questions

Q1: Design the Saga for an e-commerce order: reserve inventory, charge payment, schedule shipping. What are the compensating transactions?

Strong answer: Use orchestration. The Saga orchestrator manages the state. Steps: 1) Reserve inventory (compensating: release inventory). 2) Charge payment (compensating: refund payment). 3) Schedule shipping (compensating: cancel shipping). Failure scenarios: if inventory reservation fails, mark order as failed (no compensation needed, nothing was done). If payment fails, release inventory, mark order as failed. If shipping fails, refund payment, release inventory, mark order as failed. The orchestrator persists its state after each step. If it crashes, it resumes from the last completed step. All steps are idempotent (use Saga ID + step ID as idempotency key). The payment step is the pivot - once payment is charged, we are committed to completing the order.

Q2: Your Saga orchestrator crashes mid-Saga. How do you ensure the Saga completes correctly?

Strong answer: The orchestrator must persist its state durably before each step. Use the outbox pattern: in the same database transaction that updates the Saga state, insert a command into the outbox table. The outbox relay publishes the command to the target service. If the orchestrator crashes after updating state but before publishing the command, the outbox relay will publish it on restart. If the orchestrator crashes after publishing but before updating state, it will re-publish on restart - so all steps must be idempotent. When the orchestrator restarts, it reads its persisted state and resumes from the last completed step. Use a distributed lock to prevent multiple orchestrator instances from running the same Saga simultaneously.

Q3: How do you handle the case where a compensating transaction also fails?

Strong answer: This is the hardest case in Saga design. If the compensating transaction fails, you have a partially compensated Saga - some steps are undone, some are not. Options: retry the compensating transaction with exponential backoff (most compensating transactions are idempotent and will eventually succeed). If retries are exhausted, move the Saga to a “manual intervention required” state and alert an operator. Provide tooling for operators to manually complete or compensate the Saga. For critical operations (payment refunds), ensure the compensating transaction is highly reliable (use a dedicated refund service with its own retry logic). Design compensating transactions to be simpler and more reliable than the forward transactions.