Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Apache Cassandra Architecture
§1Step 2 — High-Level Design
A masterless distributed database. Gossip, token rings, read repair, and tunable consistency.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Add write-ahead log nodes representing Cassandra's commit log — every write is durably appended here before being written to the in-memory memtable.
Cassandra's commit log is a per-node WAL: every write is appended sequentially before the in-memory memtable is updated. When a memtable is flushed to disk as an SSTable, the corresponding commit log segment is discarded.
Without the commit log, a crash would lose all unflushed memtable data. The commit log provides durability at near-zero latency cost because it's sequential disk I/O and is written asynchronously after acknowledgment (with periodic fsync).
CommitLogSyncBatchWindowInMs controls the tradeoff between durability and throughput. 'periodic' sync (default 10ms) risks losing 10ms of writes on crash; 'batch' sync waits for fsync before ack — higher durability, lower throughput.
Cassandra's write path achieves 100K+ writes/sec per node because commit log append + memtable insert are both sequential/in-memory operations. Netflix uses Cassandra for viewing history at 1M+ writes/sec cluster-wide.
Commit log size: typically 1 segment = 32 MB. At 100K writes/sec (avg 1KB), a segment fills in ~5 minutes. SSD-backed commit logs: <1ms p99 write latency.
Add worker services representing Cassandra's compaction process — merging SSTables on disk to reduce read amplification and reclaim space from tombstones.
Compaction workers merge multiple SSTables into one, deduplicating keys and dropping tombstones past their gc_grace_seconds. Cassandra supports multiple strategies: STCS (size-tiered, write-optimized), LCS (leveled, read-optimized), and TWCS (time-window, for time-series data).
Each memtable flush creates a new immutable SSTable. Without compaction, read queries must check all SSTables for the most recent version of a row — O(n) read amplification. Compaction keeps this bounded.
Compaction consumes disk I/O and CPU. STCS requires up to 2x peak disk space during compaction. LCS uses less space but more I/O. Operators must tune compaction throughput to avoid impacting foreground reads.
Discord moved from MongoDB to Cassandra for their message history. They use TWCS for messages (time-series data) — TWCS compacts only within time windows, minimizing cross-window I/O.
Compaction throughput: 8–16 MB/s per node (default throttle). STCS space amplification: up to 2x. TWCS space amplification: <10%. LCS disk space: ~10x data size (10 levels).
Add observability tracking Cassandra-specific metrics: SSTable count, compaction lag, tombstone ratio, and replica consistency.
Key Cassandra metrics: SSTable count per node/table, compaction throughput and lag, read/write latency (p50/p99), tombstone warnings in reads, dropped messages (overload), and pending hints (for down nodes).
Cassandra's LSM-tree structure means performance can degrade silently: too many SSTables raise read latency, excessive tombstones slow queries, and pending hints indicate replica lag. These don't appear as errors until SLOs are breached.
Cassandra's metrics are JMX-native, requiring a metrics bridge (Prometheus JMX exporter) for modern observability stacks. Some teams use nodetool status scripts, trading real-time granularity for simplicity.
Apple runs 160K+ Cassandra nodes across multiple datacenters. Their SRE teams rely on compaction lag and SSTable count as primary health signals for proactive scaling.
Healthy SSTable count: <20 per table per node. Tombstone warning threshold: 1000 per read. Compaction lag acceptable: <1 hour for STCS, <15 min for TWCS.
§2Step 3 — Deep Dive
Cassandra's commit log is a per-node WAL: every write is appended sequentially before the in-memory memtable is updated. When a memtable is flushed to disk as an SSTable, the corresponding commit log segment is discarded.
| System | Write Availability | Consistency | Data Model | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Cassandra (RF=3, QUORUM) | Multi-master, always writeable | Tunable (ONE/QUORUM/ALL) | Wide-column, denormalized | Time-series, IoT, multi-DC writes ✓ | Medium | High |
| DynamoDB | Always writeable (sloppy quorum) | Eventual -> strong | KV + document | Serverless, AWS-native, simple ops | High | Low |
| HBase | Single-region master | Strong (ZooKeeper) | Wide-column (Hadoop) | Batch analytics, Hadoop ecosystem | Medium | High |
| ScyllaDB | Multi-master (Cassandra protocol) | Tunable | Wide-column (Cassandra compat) | Cassandra workloads, lower latency | Medium | High |
| Google Bigtable | Regional HA | Strong | Wide-column | GCP-native, sparse analytics | High | Low |
Distributed wide-column stores — Cassandra wins for multi-DC write availability.
from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy
from datetime import datetime
cluster = Cluster(
contact_points=['cass-dc1-1', 'cass-dc1-2'],
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='us-east'),
)
session = cluster.connect()
# CREATE TABLE designed for query pattern -- no joins in Cassandra
# Wide row: one partition per (user, month) -> ~30K rows max per partition
session.execute("""
CREATE TABLE IF NOT EXISTS events_by_user (
user_id UUID,
month TEXT,
event_ts TIMESTAMP,
event_id TIMEUUID,
payload TEXT,
PRIMARY KEY ((user_id, month), event_ts, event_id)
) WITH CLUSTERING ORDER BY (event_ts DESC)
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1
}
AND default_time_to_live = 2592000;
""")
# UNLOGGED BATCH -- same partition only
# Never use LOGGED BATCH across partitions (coordinator bottleneck)
stmt = session.prepare(
"INSERT INTO events_by_user (user_id, month, event_ts, event_id, payload) VALUES (?, ?, ?, now(), ?)"
)
session.execute(stmt, [user_id, '2026-04', datetime.utcnow(), payload])| Component | Why Add It | Tradeoff |
|---|---|---|
| Write-Ahead Log (Commit Log) | Without the commit log, a crash would lose all unflushed memtable data. | CommitLogSyncBatchWindowInMs controls the tradeoff between durability and throughput. |
| Compaction Workers | Each memtable flush creates a new immutable SSTable. | Compaction consumes disk I/O and CPU. |
| Monitoring | Cassandra's LSM-tree structure means performance can degrade silently: too many SSTables raise read latency, excessive tombstones slow queries, and pending hints indicate replica lag. | Cassandra's metrics are JMX-native, requiring a metrics bridge (Prometheus JMX exporter) for modern observability stacks. |
Design decision tradeoffs
cass-1 goes down. Writes with consistency level QUORUM now require 2 of the 2 remaining nodes (cass-2, cass-3) to acknowledge. If one of those is slow, writes fail. How do you tune consistency levels, use hinted handoff to buffer writes, and repair the node when it comes back up?
A network partition splits the Cassandra ring into two halves. Clients on each side can write to their local nodes. When the partition heals, both halves have divergent data for the same partition keys. How does last-write-wins reconciliation and read repair resolve the conflict?
A time-series workload writes all data for the current hour to the same Cassandra partition, creating a hot row that overwhelms cass-2. Reads and writes for that partition key saturate cass-2's CPU. How do you redesign the partition key with a bucket prefix to spread writes across multiple nodes?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Write-Ahead Log (Commit Log) | Cassandra's commit log is a per-node WAL: every write is appended sequentially before the in-memory memtable is updated. | Without the commit log, a crash would lose all unflushed memtable data. |
| Compaction Workers | Compaction workers merge multiple SSTables into one, deduplicating keys and dropping tombstones past their gc_grace_seconds. | Each memtable flush creates a new immutable SSTable. |
| Monitoring | Key Cassandra metrics: SSTable count per node/table, compaction throughput and lag, read/write latency (p50/p99), tombstone warnings in reads, dropped messages (overload), and pending hints (for down nodes). | Cassandra's LSM-tree structure means performance can degrade silently: too many SSTables raise read latency, excessive tombstones slow queries, and pending hints indicate replica lag. |
Key design decisions