Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Message Queue (Kafka)

intermediateMessagingDistribution

Messaging·55 min read

Message Queue (Kafka)

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

MessagingDistribution

§1Step 2 — High-Level Design

2High-Level Design

Build a distributed message queue. Partitioning, consumer groups, at-least-once delivery guarantees.

System architecture overview

Stage 1 of 4Starting state — the problem to solve

Progressive build — add each component step by step

Add a Message Queue (Kafka Broker)

Place Kafka between the producers and consumers to decouple message sending from processing.

What it does

Apache Kafka is a distributed message queue organized into topics and partitions. Producers append messages to topics; consumer groups read from topics, with each partition consumed by exactly one consumer in the group.

Why it matters

Without a queue, producer-2 calling consumer workers directly creates tight coupling: if a consumer is slow, the producer must wait. Kafka buffers unlimited messages, absorbs traffic spikes, and lets consumers process at their own pace.

Trade-off

Kafka messages are durable but ordered only within a partition. For global ordering, use a single partition (limits parallelism). For parallelism, use multiple partitions (loses cross-partition ordering).

Real world

LinkedIn uses Kafka for 7 trillion messages/day. Uber uses Kafka for trip lifecycle events. Confluent's hosted Kafka serves Netflix, Twitter, Goldman Sachs.

Capacity math

A single Kafka broker handles 100MB/second throughput. A 3-broker cluster with replication handles 200MB/second writes and millions of messages/second.

In the real world: LinkedIn uses Kafka for 7 trillion messages/day. Uber uses Kafka for trip lifecycle events. Confluent's hosted Kafka serves Netflix, Twitter, Goldman Sachs.

Add Consumer Worker Services

Add worker services that form a consumer group, each processing a partition of messages from Kafka.

What it does

Consumer workers form a consumer group that collectively reads all partitions of a Kafka topic. Kafka assigns partitions to workers — each partition is owned by exactly one worker, ensuring ordered processing within a partition.

Why it matters

One consumer can't keep up with high-throughput topics. Kafka's consumer group model enables parallel processing proportional to the number of partitions. Add workers = add throughput.

Trade-off

Max parallelism = number of partitions. Plan partition count upfront (you can't decrease partitions). 5 partitions × 5 workers is ideal; more workers than partitions means some workers sit idle.

Real world

Shopify uses consumer groups for order processing. Stripe uses Kafka consumers for webhook delivery. Pinterest uses Kafka consumer groups for notification fanout.

Capacity math

Each worker processes 10K-100K messages/second depending on message complexity. 5 workers with simple transformations can process 500K messages/second from a 5-partition topic.

In the real world: Shopify uses consumer groups for order processing. Stripe uses Kafka consumers for webhook delivery. Pinterest uses Kafka consumer groups for notification fanout.

Add Redis for Consumer State

Add Redis to store consumer offset tracking, deduplication state, and dead-letter queue metadata.

What it does

Redis stores metadata for the consumer workers: processed message offsets (to resume after restart), deduplication keys (to prevent double-processing), and metrics (processing rate, lag).

Why it matters

Kafka stores offsets but Redis provides faster reads for consumer state management. Redis is particularly useful for exactly-once deduplication: SET dedup:{msgId} 1 NX — only process if the key didn't exist.

Trade-off

Redis deduplication state has a TTL — messages deduped for 24 hours may be reprocessed after the TTL expires. Size the TTL based on maximum expected message redelivery window.

Real world

Confluent's Kafka consumer uses Redis for consumer group coordination in some deployments. Most Kafka consumers use Redis alongside Kafka for idempotency tracking.

Capacity math

1M message IDs in Redis dedup set × 50 bytes each = 50MB. Consumer offset state: one key per topic-partition-group = negligible. Total Redis memory: < 1GB.

In the real world: Confluent's Kafka consumer uses Redis for consumer group coordination in some deployments. Most Kafka consumers use Redis alongside Kafka for idempotency tracking.

Kafka Broker Crash: mq-1 (a Kafka broker) crashes. Partitions whose leader was on mq-1 are unavailable until a follower is elected leader. Producers get errors for those partitions. How do you set min.insync.replicas and acks=all to prevent data loss and trigger fast leader election?

§2Step 3 — Deep Dive

3Deep Dive

System	Throughput	Message replay?	Ordering	Best for	Cost	Ops burden
Kafka	1M+ msg/sec/broker	Yes (log retention)	Per-partition	High throughput, event sourcing ✓	Medium	High
RabbitMQ	50K msg/sec	No (acked = deleted)	Per-queue	Complex routing, low latency	Medium	Medium
AWS SQS	3K msg/sec (standard)	No	Best-effort	Serverless, AWS-native	Medium	Low
Redis Streams	500K msg/sec	Yes (stream log)	Per-stream	Low latency, small messages	Medium	Low
Pulsar	1M+ msg/sec	Yes (tiered storage)	Per-partition	Multi-tenancy, geo-replication	High	High

Message queue systems — Kafka for high throughput + replay, RabbitMQ for complex routing.

pythonKafka consumer — manual offset commit with idempotent deduplication

from kafka import KafkaConsumer
import redis, json

consumer = KafkaConsumer(
    'orders',
    group_id='order-processor',
    bootstrap_servers=['kafka:9092'],
    enable_auto_commit=False,
    auto_offset_reset='earliest'
)
dedup = redis.Redis()

for message in consumer:
    msg_id = f"{message.partition}:{message.offset}"

    if dedup.set(f"dedup:{msg_id}", 1, nx=True, ex=86400):
        try:
            process_order(json.loads(message.value))
            consumer.commit()
        except Exception as e:
            log.error(f"Failed to process {msg_id}: {e}")
    else:
        consumer.commit()  # already processed

Component	Why Add It	Tradeoff
Message Queue (Kafka Broker)	Without a queue, producer-2 calling consumer workers directly creates tight coupling: if a consumer is slow, the producer must wait.	Kafka messages are durable but ordered only within a partition.
Consumer Worker Services	One consumer can't keep up with high-throughput topics.	Max parallelism = number of partitions.
Redis for Consumer State	Kafka stores offsets but Redis provides faster reads for consumer state management.	Redis deduplication state has a TTL — messages deduped for 24 hours may be reprocessed after the TTL expires.

Design decision tradeoffs

Kafka Broker Crash

mq-1 (a Kafka broker) crashes. Partitions whose leader was on mq-1 are unavailable until a follower is elected leader. Producers get errors for those partitions. How do you set min.insync.replicas and acks=all to prevent data loss and trigger fast leader election?

Producer Throughput Spike

Producers suddenly emit 5x normal message volume. Consumer groups fall behind; lag grows to millions of messages. How do you implement consumer group scaling, partition rebalancing, and lag-based autoscaling to catch up?

Consumer-Broker Partition

consumer-1 loses network access to mq-1 but the ZooKeeper/KRaft quorum remains reachable. The consumer's partition assignment isn't revoked immediately. consumer-2 doesn't pick up the stalled partitions. How do you tune session.timeout.ms and heartbeat.interval.ms for fast rebalance?

Kafka partitions a topic into N ordered sub-logs. Messages with the same partition key always go to the same partition (preserving order per key). Add a Message Queue with 5 partitions.

A consumer group has up to N instances (N = partition count). Each instance owns a subset of partitions exclusively. This guarantees ordered per-key processing with horizontal parallelism.

Offset management: each consumer tracks its position (offset) in each partition. Committing an offset means "I have processed all messages up to this point." On restart, consumers resume from committed offset — enabling replay and exactly-once processing.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Message Queue (Kafka Broker)	Apache Kafka is a distributed message queue organized into topics and partitions.	Without a queue, producer-2 calling consumer workers directly creates tight coupling: if a consumer is slow, the producer must wait.
Consumer Worker Services	Consumer workers form a consumer group that collectively reads all partitions of a Kafka topic.	One consumer can't keep up with high-throughput topics.
Redis for Consumer State	Redis stores metadata for the consumer workers: processed message offsets (to resume after restart), deduplication keys (to prevent double-processing), and metrics (processing rate, lag).	Kafka stores offsets but Redis provides faster reads for consumer state management.

Key design decisions

If the interviewer asks to scale 10×: Fan out without falling over. Partition your message stream by a key that balances consumer parallelism (topic + user shard, not random).

10× Target10.0M RPSwhere your architecture must hold

What's next