Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Event-Driven Architecture
§1Step 2 — High-Level Design
Replace synchronous calls with events. Design event schemas, ordering guarantees, and consumer groups.
Place an Event Bus between the order service and downstream consumers (inventory, email, analytics) to decouple them.
An Event Bus is a publish-subscribe messaging system where producers emit events and any number of consumers receive them independently.
Without an event bus, order-svc must call inventory-svc, email-svc, and analytics-svc synchronously. If any downstream service is slow or down, order processing fails. The event bus decouples producers from consumers — order-svc doesn't know or care who's listening.
Event-driven systems are harder to trace and debug (no synchronous call stack). You need event schema versioning and dead-letter queues for failed processing.
Shopify processes order events through Kafka. Airbnb uses event-driven architecture for booking confirmations. Amazon's entire microservices ecosystem is event-driven via SNS/SQS.
Kafka handles 1M+ events/second per cluster. Each service gets its own consumer group, reading independently at its own pace.
At high traffic, add a durable message queue between producers and consumers to decouple them and handle backpressure.
A message queue stores events durably and delivers them to consumers at a rate they can handle.
Without a queue, a traffic spike overwhelms consumers. The queue provides backpressure, buffering, and at-least-once delivery guarantees.
Messages may be processed out of order. Partition by key (user_id, entity_id) to guarantee per-entity ordering.
Kafka powers event pipelines at LinkedIn, Airbnb, and Uber. SQS is the AWS equivalent for simpler use cases.
Kafka handles 1M+ events/second per broker. A 3-broker cluster easily handles 10M events/second.
At peak, scale consumer workers horizontally behind a load balancer to process the event backlog faster.
A load balancer (or Kafka consumer group) distributes queue partitions across multiple consumer worker instances.
At peak, a single consumer can't process events fast enough. Adding workers scales throughput linearly up to the partition count.
Consumer lag may grow during sudden spikes. Monitor consumer lag and auto-scale based on lag metrics.
Netflix Flink and Spark Streaming auto-scale consumers based on Kafka lag. AWS Lambda can scale to thousands of concurrent consumers.
Each consumer partition processes ~50K events/second. 20 partitions = 1M events/second consumer capacity.
§2Step 3 — Deep Dive
An Event Bus is a publish-subscribe messaging system where producers emit events and any number of consumers receive them independently.
| Pattern | Coupling | Message replay? | Fan-out | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Direct call (sync) | Tight | No | No | Simple 2-service flows | Low | Low |
| Task queue (RabbitMQ) | Loose | No (acked = gone) | Partial | Job queues, work distribution | Medium | Medium |
| Event stream (Kafka) | Loose | Yes (log retention) | Yes (consumer groups) | Event sourcing, audit log ✓ | High | High |
| Pub/Sub (Redis) | Loose | No (fire-and-forget) | Yes | Real-time notifications, low volume | Medium | Medium |
| Outbox pattern | Loose | Yes | Yes | At-least-once with DB consistency | Medium | Medium |
Messaging patterns — Kafka for event streaming, RabbitMQ for task queues.
# Outbox pattern: write event to DB in same transaction as business logic.
# A poller relays outbox rows to Kafka. Guarantees at-least-once delivery.
def place_order(order_data: dict, db_conn):
with db_conn.transaction():
# 1. Write business data
order_id = db_conn.execute(
"INSERT INTO orders (user_id, total) VALUES (%s, %s) RETURNING id",
order_data['user_id'], order_data['total']
).scalar()
# 2. Write event to outbox IN THE SAME TRANSACTION
db_conn.execute(
"""INSERT INTO outbox (aggregate_id, event_type, payload, status)
VALUES (%s, 'OrderPlaced', %s, 'pending')""",
order_id, json.dumps({'order_id': order_id, **order_data})
)
# Transaction commits atomically — both rows or neither
# Outbox relay runs every 100ms
def relay_outbox(db_conn, kafka_producer):
rows = db_conn.execute(
"SELECT * FROM outbox WHERE status = 'pending' LIMIT 100"
).fetchall()
for row in rows:
kafka_producer.produce('orders', key=row.aggregate_id, value=row.payload)
db_conn.execute("UPDATE outbox SET status='sent' WHERE id=%s", row.id)| Component | Why Add It | Tradeoff |
|---|---|---|
| Event Bus | Without an event bus, order-svc must call inventory-svc, email-svc, and analytics-svc synchronously. | Event-driven systems are harder to trace and debug (no synchronous call stack). |
| Message Queue | Without a queue, a traffic spike overwhelms consumers. | Messages may be processed out of order. |
| Load Balancer for Consumers | At peak, a single consumer can't process events fast enough. | Consumer lag may grow during sudden spikes. |
Design decision tradeoffs
inventory-svc crashes after receiving an 'order.created' event but before publishing 'inventory.reserved'. The event is lost. Orders are created but inventory never reserved. How do you ensure at-least-once delivery and idempotency?
A single high-volume event triggers cascading fan-out: one 'order.created' event triggers inventory, email, analytics, shipping — all simultaneously. Downstream services are overwhelmed. How do you apply back-pressure or prioritize consumers?
The message broker loses quorum: some brokers can receive events but not replicate them. Producers think writes succeeded; consumers don't see them. How do you detect and recover from split-brain message loss?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Event Bus | An Event Bus is a publish-subscribe messaging system where producers emit events and any number of consumers receive them independently. | Without an event bus, order-svc must call inventory-svc, email-svc, and analytics-svc synchronously. |
| Message Queue | A message queue stores events durably and delivers them to consumers at a rate they can handle. | Without a queue, a traffic spike overwhelms consumers. |
| Load Balancer for Consumers | A load balancer (or Kafka consumer group) distributes queue partitions across multiple consumer worker instances. | At peak, a single consumer can't process events fast enough. |
Key design decisions