Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Apache Kafka Deep Dive

advancedMessagingStreaming

Distributed Systems·75 min read

Apache Kafka Deep Dive

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

MessagingStreaming

§1Step 2 — High-Level Design

2High-Level Design

Process 10 trillion events per day. Log compaction, exactly-once semantics, and Kafka Streams.

System architecture overview

Interactive diagram locked

Upgrade to Pro to build and run this system.

Stage 1 of 5Starting state — the problem to solve

Progressive build — add each component step by step

Interactive diagram locked

Upgrade to Pro to build and run this system.

Add Write-Ahead Log Partitions

Add write-ahead log brokers as the core Kafka partition storage. Each partition is an append-only WAL stored on disk and replicated across brokers.

What it does

Each Kafka partition is a write-ahead log: an ordered, immutable sequence of records on disk. Producers append to the leader partition; brokers replicate to followers. Consumers read by offset, never deleting records until log compaction or retention expiry.

Why it matters

Sequential disk writes are 100x faster than random writes. By treating storage as an append-only log, Kafka saturates disk throughput and enables zero-copy transfers via sendfile() — avoiding kernel-user space copies.

Trade-off

Append-only logs consume disk space proportional to retention period. Kafka uses retention policies (time or size) and log compaction (keep only latest value per key) to bound disk usage.

Real world

LinkedIn's Kafka handles 7 trillion events/day. Each partition achieves ~200MB/s write throughput. Zero-copy sendfile enables 700MB/s consumer throughput per broker.

Capacity math

Single partition throughput: 100–200 MB/s write. Per-broker storage: typically 2–10 TB. Retention: 7 days default = ~100 GB/day at 10 MB/s ingest.

In the real world: LinkedIn's Kafka handles 7 trillion events/day. Each partition achieves ~200MB/s write throughput. Zero-copy sendfile enables 700MB/s consumer throughput per broker.

Add KRaft Consensus Nodes

Add KRaft (Kafka Raft) consensus nodes to replace ZooKeeper for controller election and cluster metadata management.

What it does

KRaft controllers form a Raft quorum (typically 3 or 5 nodes) that stores cluster metadata — broker membership, partition assignments, topic configurations, ISR state — in a replicated log. The active controller propagates metadata changes to all brokers.

Why it matters

ZooKeeper was a separate system with its own operational burden. KRaft eliminates ZooKeeper, reducing failure modes, improving partition scalability (ZK struggled beyond 200K partitions), and enabling fast controller failover.

Trade-off

KRaft controllers must be carefully separated from broker JVMs to avoid GC pauses affecting quorum latency. In large clusters, the metadata log can grow large and requires snapshotting.

Real world

Kafka 3.x defaults to KRaft mode. Confluent Cloud runs KRaft at multi-million partition scale across thousands of brokers.

Capacity math

KRaft metadata log: typically <1 GB even for large clusters. Controller election: <2 seconds. Metadata propagation to 1000 brokers: <500ms.

In the real world: Kafka 3.x defaults to KRaft mode. Confluent Cloud runs KRaft at multi-million partition scale across thousands of brokers.

Add Consumer Group Workers

Add worker services representing consumer groups that read from partitions in parallel. Each consumer in a group is assigned a subset of partitions.

What it does

Consumer group workers each own a disjoint subset of partitions. The group coordinator (a broker) runs the group membership protocol: when a consumer joins or leaves, a rebalance reassigns partitions. Each worker tracks its own offset per partition.

Why it matters

Parallel consumption scales throughput linearly with partition count. Workers can checkpoint their offsets to Kafka's internal __consumer_offsets topic, enabling exactly-once processing semantics when combined with transactional producers.

Trade-off

Rebalances pause consumption. A consumer crashing during rebalance causes an 'assignment storm'. Kafka 3.x incremental cooperative rebalancing reduces pause duration by only reassigning affected partitions.

Real world

Uber's real-time pipelines run consumer groups at 10,000+ consumers across hundreds of services. LinkedIn uses consumer groups for feed generation and metrics aggregation at petabyte scale.

Capacity math

Consumer throughput: ~100–500 MB/s per consumer. Max consumers per group = partition count. Rebalance duration: 1–30 seconds depending on group size.

In the real world: Uber's real-time pipelines run consumer groups at 10,000+ consumers across hundreds of services. LinkedIn uses consumer groups for feed generation and metrics aggregation at petabyte scale.

Add Object Storage for Tiered Storage

Add object storage (S3-compatible) for Kafka tiered storage — offloading cold partition segments from broker disk to cheap object storage.

What it does

Kafka tiered storage uploads completed log segments to object storage while the broker keeps only the hot tail on local disk. Consumers seeking old offsets trigger a transparent fetch from object storage.

Why it matters

Without tiered storage, long retention requires proportionally large broker disks. Tiered storage breaks this coupling — you can have 30-day retention without 30x the disk cost by storing cold data in S3 at 10x lower cost/GB.

Trade-off

Object storage fetch latency (10–100ms) is higher than local disk (1–5ms). Tiered storage adds complexity: segment upload race conditions, consistency between local and remote state, and partial segment handling.

Real world

Confluent introduced Infinite Storage (tiered storage) in 2021. Customers run 90-day retention pipelines at 1/10th the cost of equivalent all-local-disk deployments.

Capacity math

Object storage cost: ~$0.023/GB/month (S3). Local disk: ~$0.10/GB/month (EBS). For 1 PB 30-day retention: object storage saves ~$77K/month vs local disk.

In the real world: Confluent introduced Infinite Storage (tiered storage) in 2021. Customers run 90-day retention pipelines at 1/10th the cost of equivalent all-local-disk deployments.

Broker Network Isolation: A broker becomes unreachable from the controller and other brokers. In-sync replicas (ISR) shrink as the broker cannot acknowledge writes. The leader for affected partitions remains, but min.insync.replicas may force write rejections if ISR falls below threshold. Consumers block until the broker rejoins or leadership transfers.

§2Step 3 — Deep Dive

3Deep Dive

System	Throughput	Message Retention	Consumer Groups	Best for	Cost	Ops burden
Kafka (KRaft)	1M+ msg/sec/broker	Configurable (log)	Yes (offset tracking)	Event sourcing, audit log, high throughput ✓	Medium	High
Apache Pulsar	1M+ msg/sec	Tiered (S3 offload)	Yes (subscription)	Multi-tenancy, geo-replication, cloud-native	High	High
RabbitMQ	50K msg/sec	Until ACK	No (queues)	Complex routing, task queues, low latency	Medium	Medium
AWS Kinesis	1MB/sec/shard	Up to 7 days	Yes (shard iterator)	AWS-native, serverless, simple setup	Medium	Low
Redis Streams	500K msg/sec	Configurable trim	Yes (consumer group)	Low latency, small messages, ephemeral	Low	Low

Message streaming platforms — Kafka wins for high-throughput durable event logs.

pythonExactly-once producer with idempotence + transactions

from kafka import KafkaProducer
from kafka.errors import KafkaError

# Idempotent producer: retries won't duplicate messages
producer = KafkaProducer(
    bootstrap_servers=['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'],
    enable_idempotence=True,           # PID + sequence number dedup
    acks='all',                        # wait for all in-sync replicas
    max_in_flight_requests_per_connection=5,
    retries=2147483647,
    transactional_id='order-producer-1',
)

# Transactional exactly-once (producer + consumer in one atomic batch)
producer.init_transactions()

def process_batch(records):
    producer.begin_transaction()
    try:
        offsets = {}
        for record in records:
            result = transform(record)
            producer.send('output-topic', value=result.encode())
            offsets[record.partition] = record.offset + 1

        # Commit offsets atomically with the produce -- exactly-once
        producer.send_offsets_to_transaction(offsets, group_id='my-consumer-group')
        producer.commit_transaction()
    except KafkaError as e:
        producer.abort_transaction()
        raise

# Partition assignment: round-robin for throughput, sticky for latency
# Replication factor 3, min.insync.replicas=2 -> survive 1 broker failure

Component	Why Add It	Tradeoff
Write-Ahead Log Partitions	Sequential disk writes are 100x faster than random writes.	Append-only logs consume disk space proportional to retention period.
KRaft Consensus Nodes	ZooKeeper was a separate system with its own operational burden.	KRaft controllers must be carefully separated from broker JVMs to avoid GC pauses affecting quorum latency.
Consumer Group Workers	Parallel consumption scales throughput linearly with partition count.	Rebalances pause consumption.
Object Storage for Tiered Storage	Without tiered storage, long retention requires proportionally large broker disks.	Object storage fetch latency (10–100ms) is higher than local disk (1–5ms).

Design decision tradeoffs

Broker Network Isolation

A broker becomes unreachable from the controller and other brokers. In-sync replicas (ISR) shrink as the broker cannot acknowledge writes. The leader for affected partitions remains, but min.insync.replicas may force write rejections if ISR falls below threshold. Consumers block until the broker rejoins or leadership transfers.

Hot Partition: Unbalanced Throughput

A single partition receives disproportionate traffic due to poor key distribution. That partition's broker becomes a bottleneck; replicas lag as they cannot keep pace with replication. ISR shrinks due to replica.lag.max.messages threshold violations. Consumers on this partition experience higher latency and potential read gaps.

Consumer Rebalance Storm

A broker crashes, triggering a cascading rebalance across all consumer groups. If rebalance.delay.ms is short, the thundering herd of rebalance operations consumes broker CPU and network, delaying message delivery. New leader election for affected partitions is delayed. Demonstrate exactly-once semantics resilience: offset commits to __consumer_offsets topic must complete before rebalance settles.

Log structure: each partition is a directory of 1GB segment files. Each segment has a .log (actual messages), .index (sparse offset→byte_position index, every ~4KB), and .timeindex (timestamp→offset index). Read at offset N: binary search .index → seek to byte position → scan .log. Zero-copy: OS sendfile() transfers segment bytes from page cache to network without copying to userspace.

ISR (In-Sync Replicas): the set of replicas that are within replica.lag.max.messages of the leader. acks=all means all ISR replicas must acknowledge before the producer receives confirmation. If a follower falls behind (network issue), it's removed from ISR. When it catches up, it rejoins. Minimum ISR size (min.insync.replicas) prevents accepting writes if too few replicas are in sync.

Exactly-once transactions: producer starts transaction → writes to multiple topic-partitions → commits transaction. Transaction coordinator writes the commit marker atomically to the __transaction_state topic (Kafka's own internal topic). Consumers set isolation.level=read_committed — only committed transactions visible. Uncommitted or aborted transactions are invisible.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Write-Ahead Log Partitions	Each Kafka partition is a write-ahead log: an ordered, immutable sequence of records on disk.	Sequential disk writes are 100x faster than random writes.
KRaft Consensus Nodes	KRaft controllers form a Raft quorum (typically 3 or 5 nodes) that stores cluster metadata — broker membership, partition assignments, topic configurations, ISR state — in a replicated log.	ZooKeeper was a separate system with its own operational burden.
Consumer Group Workers	Consumer group workers each own a disjoint subset of partitions.	Parallel consumption scales throughput linearly with partition count.
Object Storage for Tiered Storage	Kafka tiered storage uploads completed log segments to object storage while the broker keeps only the hot tail on local disk.	Without tiered storage, long retention requires proportionally large broker disks.

Key design decisions

If the interviewer asks to scale 10×: 10x the load — architectural moves that work. Identify the single bottleneck (usually the database write path) and address it first before horizontal scaling.

10× Target1000.0M RPSwhere your architecture must hold

What's next

Event-Driven Architecture

30 min read