Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Amazon DynamoDB Internals

advancedDatabaseDistributed Systems

Distributed Systems·70 min read

Amazon DynamoDB Internals

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

DatabaseDistributed Systems

§1Step 2 — High-Level Design

2High-Level Design

Understand the paper that defined NoSQL. Consistent hashing, vector clocks, sloppy quorum, and gossip.

System architecture overview

Interactive diagram locked

Upgrade to Pro to build and run this system.

Stage 1 of 4Starting state — the problem to solve

Progressive build — add each component step by step

Interactive diagram locked

Upgrade to Pro to build and run this system.

Add a Vector Clock

Add a Vector Clock to track causal ordering of writes across distributed Dynamo nodes. Each node maintains its own version counter.

What it does

A vector clock is a data structure that assigns a logical timestamp per replica. Each write increments the local counter and includes the full vector, enabling causal ordering without global coordination. When a node receives a write, it increments its own entry and includes all entries, forming a per-key version history. Dynamo uses vector clocks to detect causality: if write B's vector dominates write A's, then B happened after A; if neither dominates, they are concurrent.

Why it matters

Dynamo uses eventually consistent replication with sloppy quorums. When N=3, W=2, R=2, two nodes may receive conflicting writes. Vector clocks capture the 'happens-before' relationship so the coordinator can detect conflicts and hand them to the application for reconciliation. Without vector clocks, last-write-wins would silently lose concurrent writes; with them, the application can merge divergent versions.

Trade-off

Vector clocks grow proportionally to the number of nodes that touched the key. Amazon bounded this by truncating stale entries — sacrificing perfect causality for bounded metadata size. Very old vector clock entries for inactive replicas are discarded, reducing memory overhead while maintaining recent-write ordering.

Real world

Amazon Shopping Cart used Dynamo's vector clock reconciliation. If two clients added items offline, both versions were surfaced for union-merge rather than last-write-wins. Each item update carried a vector clock; when versions diverged, the cart merged them client-side, ensuring no loses.

Capacity math

Vector clock metadata: 10-20 bytes per node in the preference list. For a 3-node preference list, ~60 bytes overhead per object version. With millions of keys, this adds GB-scale metadata overhead, motivating Amazon's truncation strategy.

In the real world: Amazon Shopping Cart used Dynamo's vector clock reconciliation. If two clients added items offline, both versions were surfaced for union-merge rather than last-write-wins. Each item update carried a vector clock; when versions diverged, the cart merged them client-side, ensuring no loses.

Add Anti-Entropy Worker Services

Add worker services that run Merkle-tree anti-entropy to detect and repair diverged replicas in the background without affecting the read/write path.

What it does

Anti-entropy workers build Merkle trees over key ranges. Two replicas exchange tree hashes; only differing branches trigger full key-value transfer, limiting sync bandwidth to the diverged portion. A replica rebuilds its Merkle tree periodically (e.g., hourly), computing hashes bottom-up from leaf keys. When two replicas compare, they negotiate which subtrees differ and stream only those key-value pairs.

Why it matters

During failures, hinted handoff stores writes on standby nodes. When the original node recovers, anti-entropy ensures it receives all writes it missed — without a full replica scan. With millions of keys, a full key-value sync is expensive. Merkle trees reduce this to logarithmic bandwidth in the number of diverged keys.

Trade-off

Merkle tree construction is CPU-intensive. Dynamo rates-limits anti-entropy to avoid impacting foreground traffic, accepting that repair may lag by minutes to hours. A large Merkle tree (1M keys) consumes significant CPU and memory; operators must tune the frequency.

Real world

Amazon's internal S3-like services use the same anti-entropy pattern. Merkle tree anti-entropy is also used in Cassandra and Riak. Netflix runs anti-entropy on Cassandra every 24 hours for each replica, ensuring eventual consistency even with data corruption or byzantine failures.

Capacity math

A Merkle tree over 1M keys: 20MB in memory per node. Anti-entropy sync bandwidth: typically <1% of replication traffic in steady state. Computing a 1M-key Merkle tree: ~2-5 seconds on modern hardware.

In the real world: Amazon's internal S3-like services use the same anti-entropy pattern. Merkle tree anti-entropy is also used in Cassandra and Riak. Netflix runs anti-entropy on Cassandra every 24 hours for each replica, ensuring eventual consistency even with data corruption or byzantine failures.

Add Monitoring

Add observability to track quorum success rates, hinted handoff queue depth, and anti-entropy lag across all nodes.

What it does

Monitoring tracks: per-node write latency, quorum failure rate (W or R not satisfied), hinted handoff queue depth, anti-entropy sync rate, and read repair frequency. Each node exposes metrics like 'hinted_handoff_pending' (count of writes awaiting replay), 'quorum_failures_per_sec', and 'merkle_tree_build_latency_ms'. These are aggregated into dashboards and alert on thresholds.

Why it matters

Dynamo's correctness depends on all N replicas eventually converging. Monitoring hinted handoff queue depth and anti-entropy lag catches cases where a recovering node is falling further behind, risking data loss if a second failure occurs. A rising handoff queue + anti-entropy lag = converging slowly = risk of split-brain if network partitions again.

Trade-off

Each additional metric adds instrumentation overhead. Dynamo's operators focused on a small set of SLO-relevant metrics (p99 latency, quorum failure rate) and relied on anti-entropy lag as an indirect health signal. Detailed per-key monitoring is expensive; aggregate metrics per node/replica suffice.

Real world

Amazon's internal dashboards track Dynamo's 'consistency horizon' — the maximum age of any unresolved divergence across replicas — as the primary SLO metric. When horizon exceeds 5 minutes, alerts fire; teams investigate.

Capacity math

Monitoring overhead: <1% CPU per node. Metric cardinality: 50 time series per Dynamo cluster. Hinted handoff queue: typically <100 items; >1000 signals sustained failure.

In the real world: Amazon's internal dashboards track Dynamo's 'consistency horizon' — the maximum age of any unresolved divergence across replicas — as the primary SLO metric. When horizon exceeds 5 minutes, alerts fire; teams investigate.

Node 2 Failure: A home replica fails. The write should still succeed through sloppy quorum, with hinted handoff replaying data back after recovery.

§2Step 3 — Deep Dive

3Deep Dive

System	Consistency Model	Availability	Conflict Resolution	Best for	Cost	Ops burden
DynamoDB (Dynamo-style)	Eventual (tunable)	Always writeable	Last-Write-Wins + versions	Shopping carts, user sessions ✓	High	Low
Cassandra	Eventual (tunable CL)	Multi-master	LWW / TTL tombstones	Time-series, IoT, write-heavy	Medium	High
Google Bigtable	Strong (single-master tablet)	Regional HA	No conflict (single writer)	Analytics, sparse data, GCP native	High	Low
Google Spanner	Strong (TrueTime)	Global HA	No conflict (Paxos)	Financial transactions, global SQL	High	Low
HBase	Strong (single-master)	HA via ZooKeeper	No conflict (single writer)	Hadoop ecosystem, batch workloads	Medium	High

Wide-column / KV stores — Dynamo wins for always-on availability at massive scale.

pythonVector clock conflict resolution + hinted handoff

import time
from collections import defaultdict

class VectorClock:
    def __init__(self, node_id: str, nodes: list):
        self.node_id = node_id
        self.clock = {n: 0 for n in nodes}

    def increment(self):
        self.clock[self.node_id] += 1
        return dict(self.clock)

    def merge(self, other: dict) -> dict:
        """On read-repair: take element-wise max."""
        return {n: max(self.clock.get(n, 0), other.get(n, 0)) for n in set(self.clock) | set(other)}

    def happens_before(self, a: dict, b: dict) -> bool:
        """True if a causally precedes b."""
        return all(a.get(n, 0) <= b.get(n, 0) for n in set(a) | set(b)) and a != b

# Hinted handoff: write to a stand-in when target node is down
class HintedHandoff:
    def __init__(self):
        self.hints: dict = defaultdict(list)  # target_node -> [writes]

    def store_hint(self, target_node: str, key: str, value: bytes, vc: dict):
        self.hints[target_node].append({'key': key, 'value': value, 'vc': vc, 'ts': time.time()})

    def replay(self, target_node: str, send_fn):
        for hint in self.hints.pop(target_node, []):
            send_fn(target_node, hint)  # replay once node recovers

# W=2, R=2, N=3 -> always consistent without coordinator
# Sloppy quorum: if node A is down, write to A' and store hint

Component	Why Add It	Tradeoff
Vector Clock	Dynamo uses eventually consistent replication with sloppy quorums.	Vector clocks grow proportionally to the number of nodes that touched the key.
Anti-Entropy Worker Services	During failures, hinted handoff stores writes on standby nodes.	Merkle tree construction is CPU-intensive.
Monitoring	Dynamo's correctness depends on all N replicas eventually converging.	Each additional metric adds instrumentation overhead.

Design decision tradeoffs

Node 2 Failure

A home replica fails. The write should still succeed through sloppy quorum, with hinted handoff replaying data back after recovery.

Network Partition: Dynamo-2 Isolated

Node 2 becomes unreachable from the preference list. Coordinator must route around the failed node and write to a handoff node. Vector clocks must detect divergence when node 2 rejoins and has stale data.

Hot Key: Single Key Skew

A frequently accessed key's preference list (dynamo-1, dynamo-2, dynamo-3) becomes unbalanced due to request hotspotting. Demonstrate how read replicas and vector clock reconciliation prevent silent data loss under conflicting writes to the hot key.

Model Dynamo as a ring of equal peers rather than a leader-based cluster. Clients can hit any coordinator, which forwards the request to the preference list on the ring and writes to multiple replicas according to N=3, W=2.

Sloppy quorum needs more than peer connectivity. Add vector clocks for conflict tracking and a hinted handoff path so writes for a failed home replica can be accepted by another node and replayed later when that replica returns.

The background system is anti-entropy and repair. Dynamo stays highly available by tolerating temporary divergence, then converging later. Your architecture should therefore include repair capacity and observability, not just the foreground read/write path.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Vector Clock	A vector clock is a data structure that assigns a logical timestamp per replica.	Dynamo uses eventually consistent replication with sloppy quorums.
Anti-Entropy Worker Services	Anti-entropy workers build Merkle trees over key ranges.	During failures, hinted handoff stores writes on standby nodes.
Monitoring	Monitoring tracks: per-node write latency, quorum failure rate (W or R not satisfied), hinted handoff queue depth, anti-entropy sync rate, and read repair frequency.	Dynamo's correctness depends on all N replicas eventually converging.

Key design decisions

If the interviewer asks to scale 10×: 10x the load — architectural moves that work. Identify the single bottleneck (usually the database write path) and address it first before horizontal scaling.

10× Target200.0M RPSwhere your architecture must hold

What's next