Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Saga Pattern
§1Step 2 — High-Level Design
Manage distributed transactions without two-phase commit. Choreography vs orchestration sagas.
Add Kafka to carry saga step events between the checkout API and downstream microservices.
Kafka carries the saga events between participants: OrderPlaced, InventoryReserved, PaymentCharged, and their compensating events (InventoryReleased, PaymentRefunded) for rollback.
A checkout involves multiple services (inventory, payment, fulfillment) that each have their own database. Traditional distributed transactions (2PC) are fragile. The Saga pattern uses compensating transactions — if payment fails, publish InventoryReleased to roll back the reservation.
Sagas are eventually consistent — there's a brief window where inventory is reserved but payment hasn't processed. For inventory, this means 'soft' reservations that expire. Compensating transactions must be idempotent.
Shopify uses Saga-like patterns for order processing. Amazon uses sagas for order fulfillment. Temporal.io provides a framework for durable saga orchestration.
1K orders/second × 5 saga events/order = 5K events/second Kafka throughput. At 1KB per event, that's 5MB/second — trivially handled by a 3-broker Kafka cluster.
Add Postgres to store the saga instance state — which steps are complete, which are pending, and any compensation history.
Postgres stores the authoritative saga state: which steps have completed, which are in progress, what data was used at each step, and the compensation history for rolled-back sagas.
Without persistent saga state, if the orchestrator crashes after inventory is reserved but before payment is charged, the saga is lost — inventory stays reserved forever. Postgres stores the saga state so the orchestrator can resume or compensate after restart.
Each saga participant (inventory-svc, payment-svc) has its own local Postgres for its state. Saga coordination happens via Kafka events — not a shared database. This is the key difference from monolithic transaction management.
Eventuate Tram (framework) uses Postgres for saga state management. Temporal.io uses Cassandra or Postgres for workflow state. NestJS Saga module stores state in Redis or Postgres.
1K sagas/second × 500 bytes per saga record = 500KB/second Postgres throughput. Small by any standard. Sagas complete in seconds — active saga table is small (< 1M rows at any time).
§2Step 3 — Deep Dive
Kafka carries the saga events between participants: OrderPlaced, InventoryReserved, PaymentCharged, and their compensating events (InventoryReleased, PaymentRefunded) for rollback.
| Pattern | Availability | Consistency | Complexity | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Saga (orchestration) | High (no blocking) | Eventual | Medium | Microservices, long-lived txns ✓ | Medium | High |
| Saga (choreography) | High | Eventual | Low (hard to debug) | Simple 2-3 step sagas | Medium | Medium |
| 2PC (two-phase commit) | Low (coordinator blocks) | Strong (ACID) | Medium | Co-located DBs, same vendor | Low | High |
| Outbox + eventual | High | Eventual (at-least-once) | Low | Event-driven, publish/subscribe | Medium | Medium |
| TCC (try-confirm-cancel) | High | Strong-eventual | High | Financial, inventory hold-confirm | Low | High |
Distributed transaction patterns — Saga for availability, 2PC for strong consistency.
class OrderSaga:
def __init__(self, order: dict):
self.order = order
self.completed = []
COMPENSATIONS = {
'reserve_inventory': 'release_inventory',
'charge_payment': 'refund_payment',
'send_notification': None,
}
def execute(self) -> bool:
steps = ['reserve_inventory', 'charge_payment', 'send_notification']
for step in steps:
try:
getattr(self, f"_do_{step}")()
self.completed.append(step)
except Exception as e:
print(f"Step {step} failed: {e}")
self._compensate()
return False
return True
def _compensate(self):
for step in reversed(self.completed):
comp = self.COMPENSATIONS.get(step)
if comp:
try:
getattr(self, f"_do_{comp}")()
except Exception as e:
print(f"Compensation {comp} failed: {e}")
def _do_reserve_inventory(self):
inventory_service.reserve(self.order['items'])
def _do_charge_payment(self):
payment_service.charge(self.order['user_id'], self.order['total'])
def _do_release_inventory(self):
inventory_service.release(self.order['items'])
def _do_refund_payment(self):
payment_service.refund(self.order['user_id'], self.order['total'])| Component | Why Add It | Tradeoff |
|---|---|---|
| Message Queue for Saga Events | A checkout involves multiple services (inventory, payment, fulfillment) that each have their own database. | Sagas are eventually consistent — there's a brief window where inventory is reserved but payment hasn't processed. |
| Postgres for Saga State | Without persistent saga state, if the orchestrator crashes after inventory is reserved but before payment is charged, the saga is lost — inventory stays reserved forever. | Each saga participant (inventory-svc, payment-svc) has its own local Postgres for its state. |
Design decision tradeoffs
Payment fails after inventory is reserved. The saga compensating transaction must release the inventory reservation.
The order saga fails at fulfillment-svc and needs to compensate by refunding payment. But payment-svc is unreachable due to a network partition. The customer is charged but the order is cancelled. How do you implement a compensation dead-letter queue and retry with exponential backoff?
A flash sale triggers 10K simultaneous order sagas, all trying to reserve inventory in inventory-svc. The service is overwhelmed and starts timing out. Saga orchestrators retry, making it worse. How do you implement rate limiting, backpressure, and saga concurrency limits?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Message Queue for Saga Events | Kafka carries the saga events between participants: OrderPlaced, InventoryReserved, PaymentCharged, and their compensating events (InventoryReleased, PaymentRefunded) for rollback. | A checkout involves multiple services (inventory, payment, fulfillment) that each have their own database. |
| Postgres for Saga State | Postgres stores the authoritative saga state: which steps have completed, which are in progress, what data was used at each step, and the compensation history for rolled-back sagas. | Without persistent saga state, if the orchestrator crashes after inventory is reserved but before payment is charged, the saga is lost — inventory stays reserved forever. |
Key design decisions