Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Amazon DynamoDB Internals
§1Step 2 — High-Level Design
Understand the paper that defined NoSQL. Consistent hashing, vector clocks, sloppy quorum, and gossip.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Interactive diagram locked
Upgrade to Pro to build and run this system.
Add a Vector Clock to track causal ordering of writes across distributed Dynamo nodes. Each node maintains its own version counter.
A vector clock is a data structure that assigns a logical timestamp per replica. Each write increments the local counter and includes the full vector, enabling causal ordering without global coordination. When a node receives a write, it increments its own entry and includes all entries, forming a per-key version history. Dynamo uses vector clocks to detect causality: if write B's vector dominates write A's, then B happened after A; if neither dominates, they are concurrent.
Dynamo uses eventually consistent replication with sloppy quorums. When N=3, W=2, R=2, two nodes may receive conflicting writes. Vector clocks capture the 'happens-before' relationship so the coordinator can detect conflicts and hand them to the application for reconciliation. Without vector clocks, last-write-wins would silently lose concurrent writes; with them, the application can merge divergent versions.
Vector clocks grow proportionally to the number of nodes that touched the key. Amazon bounded this by truncating stale entries — sacrificing perfect causality for bounded metadata size. Very old vector clock entries for inactive replicas are discarded, reducing memory overhead while maintaining recent-write ordering.
Amazon Shopping Cart used Dynamo's vector clock reconciliation. If two clients added items offline, both versions were surfaced for union-merge rather than last-write-wins. Each item update carried a vector clock; when versions diverged, the cart merged them client-side, ensuring no loses.
Vector clock metadata: 10-20 bytes per node in the preference list. For a 3-node preference list, ~60 bytes overhead per object version. With millions of keys, this adds GB-scale metadata overhead, motivating Amazon's truncation strategy.
Add worker services that run Merkle-tree anti-entropy to detect and repair diverged replicas in the background without affecting the read/write path.
Anti-entropy workers build Merkle trees over key ranges. Two replicas exchange tree hashes; only differing branches trigger full key-value transfer, limiting sync bandwidth to the diverged portion. A replica rebuilds its Merkle tree periodically (e.g., hourly), computing hashes bottom-up from leaf keys. When two replicas compare, they negotiate which subtrees differ and stream only those key-value pairs.
During failures, hinted handoff stores writes on standby nodes. When the original node recovers, anti-entropy ensures it receives all writes it missed — without a full replica scan. With millions of keys, a full key-value sync is expensive. Merkle trees reduce this to logarithmic bandwidth in the number of diverged keys.
Merkle tree construction is CPU-intensive. Dynamo rates-limits anti-entropy to avoid impacting foreground traffic, accepting that repair may lag by minutes to hours. A large Merkle tree (1M keys) consumes significant CPU and memory; operators must tune the frequency.
Amazon's internal S3-like services use the same anti-entropy pattern. Merkle tree anti-entropy is also used in Cassandra and Riak. Netflix runs anti-entropy on Cassandra every 24 hours for each replica, ensuring eventual consistency even with data corruption or byzantine failures.
A Merkle tree over 1M keys: 20MB in memory per node. Anti-entropy sync bandwidth: typically <1% of replication traffic in steady state. Computing a 1M-key Merkle tree: ~2-5 seconds on modern hardware.
Add observability to track quorum success rates, hinted handoff queue depth, and anti-entropy lag across all nodes.
Monitoring tracks: per-node write latency, quorum failure rate (W or R not satisfied), hinted handoff queue depth, anti-entropy sync rate, and read repair frequency. Each node exposes metrics like 'hinted_handoff_pending' (count of writes awaiting replay), 'quorum_failures_per_sec', and 'merkle_tree_build_latency_ms'. These are aggregated into dashboards and alert on thresholds.
Dynamo's correctness depends on all N replicas eventually converging. Monitoring hinted handoff queue depth and anti-entropy lag catches cases where a recovering node is falling further behind, risking data loss if a second failure occurs. A rising handoff queue + anti-entropy lag = converging slowly = risk of split-brain if network partitions again.
Each additional metric adds instrumentation overhead. Dynamo's operators focused on a small set of SLO-relevant metrics (p99 latency, quorum failure rate) and relied on anti-entropy lag as an indirect health signal. Detailed per-key monitoring is expensive; aggregate metrics per node/replica suffice.
Amazon's internal dashboards track Dynamo's 'consistency horizon' — the maximum age of any unresolved divergence across replicas — as the primary SLO metric. When horizon exceeds 5 minutes, alerts fire; teams investigate.
Monitoring overhead: <1% CPU per node. Metric cardinality: 50 time series per Dynamo cluster. Hinted handoff queue: typically <100 items; >1000 signals sustained failure.
§2Step 3 — Deep Dive
A vector clock is a data structure that assigns a logical timestamp per replica. Each write increments the local counter and includes the full vector, enabling causal ordering without global coordination. When a node receives a write, it increments its own entry and includes all entries, forming a per-key version history. Dynamo uses vector clocks to detect causality: if write B's vector dominates write A's, then B happened after A; if neither dominates, they are concurrent.
| System | Consistency Model | Availability | Conflict Resolution | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| DynamoDB (Dynamo-style) | Eventual (tunable) | Always writeable | Last-Write-Wins + versions | Shopping carts, user sessions ✓ | High | Low |
| Cassandra | Eventual (tunable CL) | Multi-master | LWW / TTL tombstones | Time-series, IoT, write-heavy | Medium | High |
| Google Bigtable | Strong (single-master tablet) | Regional HA | No conflict (single writer) | Analytics, sparse data, GCP native | High | Low |
| Google Spanner | Strong (TrueTime) | Global HA | No conflict (Paxos) | Financial transactions, global SQL | High | Low |
| HBase | Strong (single-master) | HA via ZooKeeper | No conflict (single writer) | Hadoop ecosystem, batch workloads | Medium | High |
Wide-column / KV stores — Dynamo wins for always-on availability at massive scale.
import time
from collections import defaultdict
class VectorClock:
def __init__(self, node_id: str, nodes: list):
self.node_id = node_id
self.clock = {n: 0 for n in nodes}
def increment(self):
self.clock[self.node_id] += 1
return dict(self.clock)
def merge(self, other: dict) -> dict:
"""On read-repair: take element-wise max."""
return {n: max(self.clock.get(n, 0), other.get(n, 0)) for n in set(self.clock) | set(other)}
def happens_before(self, a: dict, b: dict) -> bool:
"""True if a causally precedes b."""
return all(a.get(n, 0) <= b.get(n, 0) for n in set(a) | set(b)) and a != b
# Hinted handoff: write to a stand-in when target node is down
class HintedHandoff:
def __init__(self):
self.hints: dict = defaultdict(list) # target_node -> [writes]
def store_hint(self, target_node: str, key: str, value: bytes, vc: dict):
self.hints[target_node].append({'key': key, 'value': value, 'vc': vc, 'ts': time.time()})
def replay(self, target_node: str, send_fn):
for hint in self.hints.pop(target_node, []):
send_fn(target_node, hint) # replay once node recovers
# W=2, R=2, N=3 -> always consistent without coordinator
# Sloppy quorum: if node A is down, write to A' and store hint| Component | Why Add It | Tradeoff |
|---|---|---|
| Vector Clock | Dynamo uses eventually consistent replication with sloppy quorums. | Vector clocks grow proportionally to the number of nodes that touched the key. |
| Anti-Entropy Worker Services | During failures, hinted handoff stores writes on standby nodes. | Merkle tree construction is CPU-intensive. |
| Monitoring | Dynamo's correctness depends on all N replicas eventually converging. | Each additional metric adds instrumentation overhead. |
Design decision tradeoffs
A home replica fails. The write should still succeed through sloppy quorum, with hinted handoff replaying data back after recovery.
Node 2 becomes unreachable from the preference list. Coordinator must route around the failed node and write to a handoff node. Vector clocks must detect divergence when node 2 rejoins and has stale data.
A frequently accessed key's preference list (dynamo-1, dynamo-2, dynamo-3) becomes unbalanced due to request hotspotting. Demonstrate how read replicas and vector clock reconciliation prevent silent data loss under conflicting writes to the hot key.
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Vector Clock | A vector clock is a data structure that assigns a logical timestamp per replica. | Dynamo uses eventually consistent replication with sloppy quorums. |
| Anti-Entropy Worker Services | Anti-entropy workers build Merkle trees over key ranges. | During failures, hinted handoff stores writes on standby nodes. |
| Monitoring | Monitoring tracks: per-node write latency, quorum failure rate (W or R not satisfied), hinted handoff queue depth, anti-entropy sync rate, and read repair frequency. | Dynamo's correctness depends on all N replicas eventually converging. |
Key design decisions