Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Apache Cassandra Architecture

advancedDatabaseConsensus

Distributed Systems·70 min read

Apache Cassandra Architecture

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

DatabaseConsensus

§1Step 2 — High-Level Design

2High-Level Design

A masterless distributed database. Gossip, token rings, read repair, and tunable consistency.

System architecture overview

Interactive diagram locked

Upgrade to Pro to build and run this system.

Stage 1 of 4Starting state — the problem to solve

Progressive build — add each component step by step

Interactive diagram locked

Upgrade to Pro to build and run this system.

Add Write-Ahead Log (Commit Log)

Add write-ahead log nodes representing Cassandra's commit log — every write is durably appended here before being written to the in-memory memtable.

What it does

Cassandra's commit log is a per-node WAL: every write is appended sequentially before the in-memory memtable is updated. When a memtable is flushed to disk as an SSTable, the corresponding commit log segment is discarded.

Why it matters

Without the commit log, a crash would lose all unflushed memtable data. The commit log provides durability at near-zero latency cost because it's sequential disk I/O and is written asynchronously after acknowledgment (with periodic fsync).

Trade-off

CommitLogSyncBatchWindowInMs controls the tradeoff between durability and throughput. 'periodic' sync (default 10ms) risks losing 10ms of writes on crash; 'batch' sync waits for fsync before ack — higher durability, lower throughput.

Real world

Cassandra's write path achieves 100K+ writes/sec per node because commit log append + memtable insert are both sequential/in-memory operations. Netflix uses Cassandra for viewing history at 1M+ writes/sec cluster-wide.

Capacity math

Commit log size: typically 1 segment = 32 MB. At 100K writes/sec (avg 1KB), a segment fills in ~5 minutes. SSD-backed commit logs: <1ms p99 write latency.

In the real world: Cassandra's write path achieves 100K+ writes/sec per node because commit log append + memtable insert are both sequential/in-memory operations. Netflix uses Cassandra for viewing history at 1M+ writes/sec cluster-wide.

Add Compaction Workers

Add worker services representing Cassandra's compaction process — merging SSTables on disk to reduce read amplification and reclaim space from tombstones.

What it does

Compaction workers merge multiple SSTables into one, deduplicating keys and dropping tombstones past their gc_grace_seconds. Cassandra supports multiple strategies: STCS (size-tiered, write-optimized), LCS (leveled, read-optimized), and TWCS (time-window, for time-series data).

Why it matters

Each memtable flush creates a new immutable SSTable. Without compaction, read queries must check all SSTables for the most recent version of a row — O(n) read amplification. Compaction keeps this bounded.

Trade-off

Compaction consumes disk I/O and CPU. STCS requires up to 2x peak disk space during compaction. LCS uses less space but more I/O. Operators must tune compaction throughput to avoid impacting foreground reads.

Real world

Discord moved from MongoDB to Cassandra for their message history. They use TWCS for messages (time-series data) — TWCS compacts only within time windows, minimizing cross-window I/O.

Capacity math

Compaction throughput: 8–16 MB/s per node (default throttle). STCS space amplification: up to 2x. TWCS space amplification: <10%. LCS disk space: ~10x data size (10 levels).

In the real world: Discord moved from MongoDB to Cassandra for their message history. They use TWCS for messages (time-series data) — TWCS compacts only within time windows, minimizing cross-window I/O.

Add Monitoring

Add observability tracking Cassandra-specific metrics: SSTable count, compaction lag, tombstone ratio, and replica consistency.

What it does

Key Cassandra metrics: SSTable count per node/table, compaction throughput and lag, read/write latency (p50/p99), tombstone warnings in reads, dropped messages (overload), and pending hints (for down nodes).

Why it matters

Cassandra's LSM-tree structure means performance can degrade silently: too many SSTables raise read latency, excessive tombstones slow queries, and pending hints indicate replica lag. These don't appear as errors until SLOs are breached.

Trade-off

Cassandra's metrics are JMX-native, requiring a metrics bridge (Prometheus JMX exporter) for modern observability stacks. Some teams use nodetool status scripts, trading real-time granularity for simplicity.

Real world

Apple runs 160K+ Cassandra nodes across multiple datacenters. Their SRE teams rely on compaction lag and SSTable count as primary health signals for proactive scaling.

Capacity math

Healthy SSTable count: <20 per table per node. Tombstone warning threshold: 1000 per read. Compaction lag acceptable: <1 hour for STCS, <15 min for TWCS.

In the real world: Apple runs 160K+ Cassandra nodes across multiple datacenters. Their SRE teams rely on compaction lag and SSTable count as primary health signals for proactive scaling.

Cassandra Node Failure: cass-1 goes down. Writes with consistency level QUORUM now require 2 of the 2 remaining nodes (cass-2, cass-3) to acknowledge. If one of those is slow, writes fail. How do you tune consistency levels, use hinted handoff to buffer writes, and repair the node when it comes back up?

§2Step 3 — Deep Dive

3Deep Dive

System	Write Availability	Consistency	Data Model	Best for	Cost	Ops burden
Cassandra (RF=3, QUORUM)	Multi-master, always writeable	Tunable (ONE/QUORUM/ALL)	Wide-column, denormalized	Time-series, IoT, multi-DC writes ✓	Medium	High
DynamoDB	Always writeable (sloppy quorum)	Eventual -> strong	KV + document	Serverless, AWS-native, simple ops	High	Low
HBase	Single-region master	Strong (ZooKeeper)	Wide-column (Hadoop)	Batch analytics, Hadoop ecosystem	Medium	High
ScyllaDB	Multi-master (Cassandra protocol)	Tunable	Wide-column (Cassandra compat)	Cassandra workloads, lower latency	Medium	High
Google Bigtable	Regional HA	Strong	Wide-column	GCP-native, sparse analytics	High	Low

Distributed wide-column stores — Cassandra wins for multi-DC write availability.

pythonCassandra data model — time-series with TTL + compaction strategy

from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy
from datetime import datetime

cluster = Cluster(
    contact_points=['cass-dc1-1', 'cass-dc1-2'],
    load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='us-east'),
)
session = cluster.connect()

# CREATE TABLE designed for query pattern -- no joins in Cassandra
# Wide row: one partition per (user, month) -> ~30K rows max per partition
session.execute("""
    CREATE TABLE IF NOT EXISTS events_by_user (
        user_id   UUID,
        month     TEXT,
        event_ts  TIMESTAMP,
        event_id  TIMEUUID,
        payload   TEXT,
        PRIMARY KEY ((user_id, month), event_ts, event_id)
    ) WITH CLUSTERING ORDER BY (event_ts DESC)
      AND compaction = {
        'class': 'TimeWindowCompactionStrategy',
        'compaction_window_unit': 'DAYS',
        'compaction_window_size': 1
      }
      AND default_time_to_live = 2592000;
""")

# UNLOGGED BATCH -- same partition only
# Never use LOGGED BATCH across partitions (coordinator bottleneck)
stmt = session.prepare(
    "INSERT INTO events_by_user (user_id, month, event_ts, event_id, payload) VALUES (?, ?, ?, now(), ?)"
)
session.execute(stmt, [user_id, '2026-04', datetime.utcnow(), payload])

Component	Why Add It	Tradeoff
Write-Ahead Log (Commit Log)	Without the commit log, a crash would lose all unflushed memtable data.	CommitLogSyncBatchWindowInMs controls the tradeoff between durability and throughput.
Compaction Workers	Each memtable flush creates a new immutable SSTable.	Compaction consumes disk I/O and CPU.
Monitoring	Cassandra's LSM-tree structure means performance can degrade silently: too many SSTables raise read latency, excessive tombstones slow queries, and pending hints indicate replica lag.	Cassandra's metrics are JMX-native, requiring a metrics bridge (Prometheus JMX exporter) for modern observability stacks.

Design decision tradeoffs

Cassandra Node Failure

cass-1 goes down. Writes with consistency level QUORUM now require 2 of the 2 remaining nodes (cass-2, cass-3) to acknowledge. If one of those is slow, writes fail. How do you tune consistency levels, use hinted handoff to buffer writes, and repair the node when it comes back up?

Ring Partition Split-Brain

A network partition splits the Cassandra ring into two halves. Clients on each side can write to their local nodes. When the partition heals, both halves have divergent data for the same partition keys. How does last-write-wins reconciliation and read repair resolve the conflict?

Hot Partition Key Overload

A time-series workload writes all data for the current hour to the same Cassandra partition, creating a hot row that overwhelms cass-2. Reads and writes for that partition key saturate cass-2's CPU. How do you redesign the partition key with a bucket prefix to spread writes across multiple nodes?

Model the write path explicitly: each Cassandra node should have a commit log in front of its LSM storage engine. A durable write lands in the commit log, updates the memtable, flushes to immutable SSTables, and is later merged by compaction workers.

Cassandra is masterless. Any node can coordinate a read or write, but gossip must connect the cluster so all nodes learn membership and failures without a leader. Build a dense peer network between the storage nodes instead of routing everything through one special node.

The background system matters too: compaction and anti-entropy repair are essential to keep LSM storage healthy and replicas converged. Observability should exist because Cassandra failures show up in compaction backlog, hinted handoff, and node-state drift before users notice.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Write-Ahead Log (Commit Log)	Cassandra's commit log is a per-node WAL: every write is appended sequentially before the in-memory memtable is updated.	Without the commit log, a crash would lose all unflushed memtable data.
Compaction Workers	Compaction workers merge multiple SSTables into one, deduplicating keys and dropping tombstones past their gc_grace_seconds.	Each memtable flush creates a new immutable SSTable.
Monitoring	Key Cassandra metrics: SSTable count per node/table, compaction throughput and lag, read/write latency (p50/p99), tombstone warnings in reads, dropped messages (overload), and pending hints (for down nodes).	Cassandra's LSM-tree structure means performance can degrade silently: too many SSTables raise read latency, excessive tombstones slow queries, and pending hints indicate replica lag.

Key design decisions

If the interviewer asks to scale 10×: 10x the load — architectural moves that work. Identify the single bottleneck (usually the database write path) and address it first before horizontal scaling.

10× Target50.0M RPSwhere your architecture must hold

What's next