Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Apache Kafka Deep Dive
§1Step 2 — High-Level Design
Process 10 trillion events per day. Log compaction, exactly-once semantics, and Kafka Streams.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Add write-ahead log brokers as the core Kafka partition storage. Each partition is an append-only WAL stored on disk and replicated across brokers.
Each Kafka partition is a write-ahead log: an ordered, immutable sequence of records on disk. Producers append to the leader partition; brokers replicate to followers. Consumers read by offset, never deleting records until log compaction or retention expiry.
Sequential disk writes are 100x faster than random writes. By treating storage as an append-only log, Kafka saturates disk throughput and enables zero-copy transfers via sendfile() — avoiding kernel-user space copies.
Append-only logs consume disk space proportional to retention period. Kafka uses retention policies (time or size) and log compaction (keep only latest value per key) to bound disk usage.
LinkedIn's Kafka handles 7 trillion events/day. Each partition achieves ~200MB/s write throughput. Zero-copy sendfile enables 700MB/s consumer throughput per broker.
Single partition throughput: 100–200 MB/s write. Per-broker storage: typically 2–10 TB. Retention: 7 days default = ~100 GB/day at 10 MB/s ingest.
Add KRaft (Kafka Raft) consensus nodes to replace ZooKeeper for controller election and cluster metadata management.
KRaft controllers form a Raft quorum (typically 3 or 5 nodes) that stores cluster metadata — broker membership, partition assignments, topic configurations, ISR state — in a replicated log. The active controller propagates metadata changes to all brokers.
ZooKeeper was a separate system with its own operational burden. KRaft eliminates ZooKeeper, reducing failure modes, improving partition scalability (ZK struggled beyond 200K partitions), and enabling fast controller failover.
KRaft controllers must be carefully separated from broker JVMs to avoid GC pauses affecting quorum latency. In large clusters, the metadata log can grow large and requires snapshotting.
Kafka 3.x defaults to KRaft mode. Confluent Cloud runs KRaft at multi-million partition scale across thousands of brokers.
KRaft metadata log: typically <1 GB even for large clusters. Controller election: <2 seconds. Metadata propagation to 1000 brokers: <500ms.
Add worker services representing consumer groups that read from partitions in parallel. Each consumer in a group is assigned a subset of partitions.
Consumer group workers each own a disjoint subset of partitions. The group coordinator (a broker) runs the group membership protocol: when a consumer joins or leaves, a rebalance reassigns partitions. Each worker tracks its own offset per partition.
Parallel consumption scales throughput linearly with partition count. Workers can checkpoint their offsets to Kafka's internal __consumer_offsets topic, enabling exactly-once processing semantics when combined with transactional producers.
Rebalances pause consumption. A consumer crashing during rebalance causes an 'assignment storm'. Kafka 3.x incremental cooperative rebalancing reduces pause duration by only reassigning affected partitions.
Uber's real-time pipelines run consumer groups at 10,000+ consumers across hundreds of services. LinkedIn uses consumer groups for feed generation and metrics aggregation at petabyte scale.
Consumer throughput: ~100–500 MB/s per consumer. Max consumers per group = partition count. Rebalance duration: 1–30 seconds depending on group size.
Add object storage (S3-compatible) for Kafka tiered storage — offloading cold partition segments from broker disk to cheap object storage.
Kafka tiered storage uploads completed log segments to object storage while the broker keeps only the hot tail on local disk. Consumers seeking old offsets trigger a transparent fetch from object storage.
Without tiered storage, long retention requires proportionally large broker disks. Tiered storage breaks this coupling — you can have 30-day retention without 30x the disk cost by storing cold data in S3 at 10x lower cost/GB.
Object storage fetch latency (10–100ms) is higher than local disk (1–5ms). Tiered storage adds complexity: segment upload race conditions, consistency between local and remote state, and partial segment handling.
Confluent introduced Infinite Storage (tiered storage) in 2021. Customers run 90-day retention pipelines at 1/10th the cost of equivalent all-local-disk deployments.
Object storage cost: ~$0.023/GB/month (S3). Local disk: ~$0.10/GB/month (EBS). For 1 PB 30-day retention: object storage saves ~$77K/month vs local disk.
§2Step 3 — Deep Dive
Each Kafka partition is a write-ahead log: an ordered, immutable sequence of records on disk. Producers append to the leader partition; brokers replicate to followers. Consumers read by offset, never deleting records until log compaction or retention expiry.
| System | Throughput | Message Retention | Consumer Groups | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Kafka (KRaft) | 1M+ msg/sec/broker | Configurable (log) | Yes (offset tracking) | Event sourcing, audit log, high throughput ✓ | Medium | High |
| Apache Pulsar | 1M+ msg/sec | Tiered (S3 offload) | Yes (subscription) | Multi-tenancy, geo-replication, cloud-native | High | High |
| RabbitMQ | 50K msg/sec | Until ACK | No (queues) | Complex routing, task queues, low latency | Medium | Medium |
| AWS Kinesis | 1MB/sec/shard | Up to 7 days | Yes (shard iterator) | AWS-native, serverless, simple setup | Medium | Low |
| Redis Streams | 500K msg/sec | Configurable trim | Yes (consumer group) | Low latency, small messages, ephemeral | Low | Low |
Message streaming platforms — Kafka wins for high-throughput durable event logs.
from kafka import KafkaProducer
from kafka.errors import KafkaError
# Idempotent producer: retries won't duplicate messages
producer = KafkaProducer(
bootstrap_servers=['kafka-1:9092', 'kafka-2:9092', 'kafka-3:9092'],
enable_idempotence=True, # PID + sequence number dedup
acks='all', # wait for all in-sync replicas
max_in_flight_requests_per_connection=5,
retries=2147483647,
transactional_id='order-producer-1',
)
# Transactional exactly-once (producer + consumer in one atomic batch)
producer.init_transactions()
def process_batch(records):
producer.begin_transaction()
try:
offsets = {}
for record in records:
result = transform(record)
producer.send('output-topic', value=result.encode())
offsets[record.partition] = record.offset + 1
# Commit offsets atomically with the produce -- exactly-once
producer.send_offsets_to_transaction(offsets, group_id='my-consumer-group')
producer.commit_transaction()
except KafkaError as e:
producer.abort_transaction()
raise
# Partition assignment: round-robin for throughput, sticky for latency
# Replication factor 3, min.insync.replicas=2 -> survive 1 broker failure| Component | Why Add It | Tradeoff |
|---|---|---|
| Write-Ahead Log Partitions | Sequential disk writes are 100x faster than random writes. | Append-only logs consume disk space proportional to retention period. |
| KRaft Consensus Nodes | ZooKeeper was a separate system with its own operational burden. | KRaft controllers must be carefully separated from broker JVMs to avoid GC pauses affecting quorum latency. |
| Consumer Group Workers | Parallel consumption scales throughput linearly with partition count. | Rebalances pause consumption. |
| Object Storage for Tiered Storage | Without tiered storage, long retention requires proportionally large broker disks. | Object storage fetch latency (10–100ms) is higher than local disk (1–5ms). |
Design decision tradeoffs
A broker becomes unreachable from the controller and other brokers. In-sync replicas (ISR) shrink as the broker cannot acknowledge writes. The leader for affected partitions remains, but min.insync.replicas may force write rejections if ISR falls below threshold. Consumers block until the broker rejoins or leadership transfers.
A single partition receives disproportionate traffic due to poor key distribution. That partition's broker becomes a bottleneck; replicas lag as they cannot keep pace with replication. ISR shrinks due to replica.lag.max.messages threshold violations. Consumers on this partition experience higher latency and potential read gaps.
A broker crashes, triggering a cascading rebalance across all consumer groups. If rebalance.delay.ms is short, the thundering herd of rebalance operations consumes broker CPU and network, delaying message delivery. New leader election for affected partitions is delayed. Demonstrate exactly-once semantics resilience: offset commits to __consumer_offsets topic must complete before rebalance settles.
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Write-Ahead Log Partitions | Each Kafka partition is a write-ahead log: an ordered, immutable sequence of records on disk. | Sequential disk writes are 100x faster than random writes. |
| KRaft Consensus Nodes | KRaft controllers form a Raft quorum (typically 3 or 5 nodes) that stores cluster metadata — broker membership, partition assignments, topic configurations, ISR state — in a replicated log. | ZooKeeper was a separate system with its own operational burden. |
| Consumer Group Workers | Consumer group workers each own a disjoint subset of partitions. | Parallel consumption scales throughput linearly with partition count. |
| Object Storage for Tiered Storage | Kafka tiered storage uploads completed log segments to object storage while the broker keeps only the hot tail on local disk. | Without tiered storage, long retention requires proportionally large broker disks. |
Key design decisions