Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Message Queue (Kafka)
§1Step 2 — High-Level Design
Build a distributed message queue. Partitioning, consumer groups, at-least-once delivery guarantees.
Place Kafka between the producers and consumers to decouple message sending from processing.
Apache Kafka is a distributed message queue organized into topics and partitions. Producers append messages to topics; consumer groups read from topics, with each partition consumed by exactly one consumer in the group.
Without a queue, producer-2 calling consumer workers directly creates tight coupling: if a consumer is slow, the producer must wait. Kafka buffers unlimited messages, absorbs traffic spikes, and lets consumers process at their own pace.
Kafka messages are durable but ordered only within a partition. For global ordering, use a single partition (limits parallelism). For parallelism, use multiple partitions (loses cross-partition ordering).
LinkedIn uses Kafka for 7 trillion messages/day. Uber uses Kafka for trip lifecycle events. Confluent's hosted Kafka serves Netflix, Twitter, Goldman Sachs.
A single Kafka broker handles 100MB/second throughput. A 3-broker cluster with replication handles 200MB/second writes and millions of messages/second.
Add worker services that form a consumer group, each processing a partition of messages from Kafka.
Consumer workers form a consumer group that collectively reads all partitions of a Kafka topic. Kafka assigns partitions to workers — each partition is owned by exactly one worker, ensuring ordered processing within a partition.
One consumer can't keep up with high-throughput topics. Kafka's consumer group model enables parallel processing proportional to the number of partitions. Add workers = add throughput.
Max parallelism = number of partitions. Plan partition count upfront (you can't decrease partitions). 5 partitions × 5 workers is ideal; more workers than partitions means some workers sit idle.
Shopify uses consumer groups for order processing. Stripe uses Kafka consumers for webhook delivery. Pinterest uses Kafka consumer groups for notification fanout.
Each worker processes 10K-100K messages/second depending on message complexity. 5 workers with simple transformations can process 500K messages/second from a 5-partition topic.
Add Redis to store consumer offset tracking, deduplication state, and dead-letter queue metadata.
Redis stores metadata for the consumer workers: processed message offsets (to resume after restart), deduplication keys (to prevent double-processing), and metrics (processing rate, lag).
Kafka stores offsets but Redis provides faster reads for consumer state management. Redis is particularly useful for exactly-once deduplication: SET dedup:{msgId} 1 NX — only process if the key didn't exist.
Redis deduplication state has a TTL — messages deduped for 24 hours may be reprocessed after the TTL expires. Size the TTL based on maximum expected message redelivery window.
Confluent's Kafka consumer uses Redis for consumer group coordination in some deployments. Most Kafka consumers use Redis alongside Kafka for idempotency tracking.
1M message IDs in Redis dedup set × 50 bytes each = 50MB. Consumer offset state: one key per topic-partition-group = negligible. Total Redis memory: < 1GB.
§2Step 3 — Deep Dive
Apache Kafka is a distributed message queue organized into topics and partitions. Producers append messages to topics; consumer groups read from topics, with each partition consumed by exactly one consumer in the group.
| System | Throughput | Message replay? | Ordering | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Kafka | 1M+ msg/sec/broker | Yes (log retention) | Per-partition | High throughput, event sourcing ✓ | Medium | High |
| RabbitMQ | 50K msg/sec | No (acked = deleted) | Per-queue | Complex routing, low latency | Medium | Medium |
| AWS SQS | 3K msg/sec (standard) | No | Best-effort | Serverless, AWS-native | Medium | Low |
| Redis Streams | 500K msg/sec | Yes (stream log) | Per-stream | Low latency, small messages | Medium | Low |
| Pulsar | 1M+ msg/sec | Yes (tiered storage) | Per-partition | Multi-tenancy, geo-replication | High | High |
Message queue systems — Kafka for high throughput + replay, RabbitMQ for complex routing.
from kafka import KafkaConsumer
import redis, json
consumer = KafkaConsumer(
'orders',
group_id='order-processor',
bootstrap_servers=['kafka:9092'],
enable_auto_commit=False,
auto_offset_reset='earliest'
)
dedup = redis.Redis()
for message in consumer:
msg_id = f"{message.partition}:{message.offset}"
if dedup.set(f"dedup:{msg_id}", 1, nx=True, ex=86400):
try:
process_order(json.loads(message.value))
consumer.commit()
except Exception as e:
log.error(f"Failed to process {msg_id}: {e}")
else:
consumer.commit() # already processed| Component | Why Add It | Tradeoff |
|---|---|---|
| Message Queue (Kafka Broker) | Without a queue, producer-2 calling consumer workers directly creates tight coupling: if a consumer is slow, the producer must wait. | Kafka messages are durable but ordered only within a partition. |
| Consumer Worker Services | One consumer can't keep up with high-throughput topics. | Max parallelism = number of partitions. |
| Redis for Consumer State | Kafka stores offsets but Redis provides faster reads for consumer state management. | Redis deduplication state has a TTL — messages deduped for 24 hours may be reprocessed after the TTL expires. |
Design decision tradeoffs
mq-1 (a Kafka broker) crashes. Partitions whose leader was on mq-1 are unavailable until a follower is elected leader. Producers get errors for those partitions. How do you set min.insync.replicas and acks=all to prevent data loss and trigger fast leader election?
Producers suddenly emit 5x normal message volume. Consumer groups fall behind; lag grows to millions of messages. How do you implement consumer group scaling, partition rebalancing, and lag-based autoscaling to catch up?
consumer-1 loses network access to mq-1 but the ZooKeeper/KRaft quorum remains reachable. The consumer's partition assignment isn't revoked immediately. consumer-2 doesn't pick up the stalled partitions. How do you tune session.timeout.ms and heartbeat.interval.ms for fast rebalance?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Message Queue (Kafka Broker) | Apache Kafka is a distributed message queue organized into topics and partitions. | Without a queue, producer-2 calling consumer workers directly creates tight coupling: if a consumer is slow, the producer must wait. |
| Consumer Worker Services | Consumer workers form a consumer group that collectively reads all partitions of a Kafka topic. | One consumer can't keep up with high-throughput topics. |
| Redis for Consumer State | Redis stores metadata for the consumer workers: processed message offsets (to resume after restart), deduplication keys (to prevent double-processing), and metrics (processing rate, lag). | Kafka stores offsets but Redis provides faster reads for consumer state management. |
Key design decisions