Backend Transport

Saga Pattern: Distributed Transactions

Stripe processes a payment. Simultaneously: charge the card, update the merchant balance, send a webhook, update analytics, calculate the fee. If step 4 fails - money has already been charged. How do you roll back something that already happened? The Saga Pattern is the answer - without 2PC and without locks.

  • **Stripe** uses Saga for payment processing: each step is a local transaction with a compensation. On failure at any step, an automatic refund is issued.
  • **Uber** orchestrates the Trip through an internal workflow engine: trip creation, driver matching, tracking, payment - all as one Saga with compensations on cancellation.
  • **DoorDash** migrated to Temporal for order management: a delivery can take hours, and the workflow lives for the entire duration with automatic retries and escalations.

The Distributed Transaction Problem

In a monolith an ACID transaction guarantees atomicity: all or nothing. In microservices each service owns its own database - a cross-service transaction is impossible with standard tooling. 2PC (Two-Phase Commit) technically solves this but locks resources during coordination and makes the system unavailable when the coordinator fails.

Hector Garcia-Molina and Kenneth Salem introduced Sagas in 1987 for long-lived transactions. In the microservices era it became the standard pattern for cross-service coordination.

Why is 2PC (Two-Phase Commit) rarely used in microservice architecture?

Choreography Saga

Choreography - each service knows which event to publish and which events to react to. There is no central coordinator: services communicate through events in a broker (Kafka, RabbitMQ). Loose coupling, but the overall flow is hard to track.

Choreography works well for simple flows with 2-3 steps. With 5+ steps it becomes difficult to answer 'what state is the saga in right now?' without additional monitoring.

The main drawback of Choreography Saga with many steps is:

Orchestration Saga

Orchestration - a central Saga Orchestrator manages each step: it calls services, receives responses, and decides what to do next. The entire flow is visible in one place. Netflix, Uber, and Stripe use this approach.

The orchestrator must be idempotent and persistent: on restart it must restore state and continue from the last step. Temporal.io solves this: workflow code is written as regular code, and the platform handles persistence.

Why does saga state need to be stored in a database (persistently) rather than only in memory?

Compensating Transactions

A compensating transaction reverses the business effect of an already-committed local transaction. This is not a SQL ROLLBACK - the data is already committed. A compensation creates a new record that cancels the previous one: a refund instead of reversing a charge.

If a compensation also fails, a retry mechanism with backoff is needed. Eventually the system either compensates successfully or requires manual intervention (Dead Letter Queue, on-call alert).

The compensating transaction for a card charge is:

Temporal: Durable Workflow Engine

Temporal.io is a platform for durable workflows: workflow code is written as regular code (TypeScript/Go/Java), while Temporal automatically handles persistence, retries, timeouts, and recovery from failures. Used at Stripe, Netflix, DoorDash, and Instacart.

Temporal stores the full execution history of a workflow in its own event store. When a worker crashes, a new worker replays the history and continues from the last point. A workflow can run for hours, days, or weeks.

Orchestration Saga and Choreography Saga are mutually exclusive

They can be combined: an orchestrator manages the top-level flow, while choreography through events handles steps internally

Real systems are often hybrid: Temporal orchestrates the main order flow while status notifications propagate via Kafka (choreography). The deciding factor is flow complexity and observability requirements.

How does Temporal recover a workflow after a worker crash?

Key Ideas

  • **Choreography vs Orchestration** - choreography via events (loose coupling, hard to debug), orchestration via a central coordinator (whole flow visible, single control point).
  • **Compensations** are business operations, not SQL ROLLBACKs. They must be idempotent - the system may call them multiple times on retry.
  • **Temporal** eliminates orchestration complexity: workflows are written as regular code while the platform handles persistence, retries, and recovery.

Related Topics

Saga Pattern runs on top of event-driven architecture and requires reliable delivery:

  • Event-Driven Architecture — Choreography Saga is built on events - understanding EDA is necessary for implementing choreography
  • Outbox Pattern and CDC — Outbox ensures reliable event publishing from Saga steps - critical for at-least-once delivery

Вопросы для размышления

  • How can idempotency of a compensating transaction be guaranteed when the external system (e.g. a bank) does not support idempotency keys?
  • At what number of steps in a saga should one switch from choreography to orchestration? What metrics help make that decision?
  • What happens if a compensating transaction also fails? How should an escalation policy be structured?

Связанные уроки

  • dist-07-transactions
Saga Pattern: Distributed Transactions

0

1

Sign In