Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Facebook Messenger
§1Step 2 — High-Level Design
Deliver 100B messages per day. Multi-device sync, end-to-end encryption, and offline delivery.
Add more WebSocket gateway nodes beyond the starter. Messenger shards the 1B+ concurrent connection load horizontally across many gateway servers.
WebSocket gateways maintain persistent TCP connections from Messenger clients. Each gateway handles a slice of the user population (sharded by userID). A session registry maps userID+deviceID to the gateway currently holding that socket.
Horizontal sharding of connections prevents any single gateway from becoming a bottleneck. When a message arrives for user A, the fan-out service looks up A's gateway in the session registry and forwards only to that server.
Adding gateways increases the cost of global broadcasts (e.g., system messages). Messenger avoids broadcasting by design — almost all messages are point-to-point or to a group of defined members.
Meta's Iris system replaced MQTT with a custom protocol for Messenger connections. At 1.3B users, Meta runs O(10K) gateway processes across multiple datacenters.
~100K connections per gateway process (each connection: ~8 KB memory). 1B concurrent connections = ~10K gateway processes. Session registry entry: ~50 bytes/connection.
Add Kafka to decouple message receipt from delivery. The Chat API publishes messages to Kafka; the fan-out service consumes and delivers to all recipient devices.
The Chat API persists the message to the NoSQL store, then publishes to Kafka (keyed by conversationID). The fan-out service consumes from Kafka, expands group membership, and routes per-device delivery to WebSocket gateways or the push gateway.
Kafka decouples the write path from delivery. The Chat API acknowledges the sender as soon as the message is persisted — delivery happens asynchronously. This keeps write latency low regardless of how many recipients need to be notified.
Async delivery via Kafka adds 2–10ms delivery latency vs direct synchronous delivery. For Messenger's 100ms p99 target, this is acceptable. The tradeoff enables horizontal scaling of fan-out independent of the write path.
Meta's Iris messaging system uses a similar async fanout pattern. WhatsApp (also Meta) uses a different approach — direct XMPP delivery — but Messenger's scale requires the Kafka-backed model.
Kafka partitioned by conversationID. Fan-out message rate: billions/day. For a 500-member group chat: 500 delivery instructions per message. Kafka throughput target: 10M messages/second cluster-wide.
Add Redis clusters for session routing (userID → gatewayID mapping) and presence state (online/offline/last seen per user).
Session cache: maps (userID, deviceID) → (gatewayID, connectionTimestamp). Updated on connect/disconnect with CAS operations to handle reconnects. Presence cache: stores last-seen timestamp and online status per user with TTL — gateway heartbeats refresh it.
Without a session cache, delivering a message to user A requires broadcasting to all gateways and checking if A is connected — O(gateways) per delivery. With a session cache, it's O(1): look up A's gateway, send directly.
Session cache entries can become stale if a gateway crashes without cleanly evicting connections. Messenger uses short TTLs (30–60s) with gateway heartbeat refresh to bound staleness. Stale entries fall back to push notification.
Meta's session registry for Messenger is a distributed Redis cluster with multi-datacenter replication. At 1B users with ~2 devices each: 2B session entries × 50 bytes = 100 GB of session state.
Session entry: ~50 bytes. 2B entries = 100 GB. Presence entry: ~20 bytes. Heartbeat rate: 1/30s per connected device. At 1B online: 33M heartbeats/second to Redis.
Add a wide-column NoSQL database (HBase/Cassandra) partitioned by conversationID to durably store all messages and serve conversation history.
Each row key = conversationID. Within a row, messages are ordered by sequence number (monotonically increasing per conversation). Columns: senderID, content, type (text/media/reaction), timestamp, read receipts. Read receipt state is stored as a sparse column per member.
Messages are write-heavy and read in conversation-order chunks (pagination). Wide-column stores are optimized for this access pattern: sequential reads within a partition are fast, and writes are append-mostly (new message = new column).
Wide-column stores trade query flexibility for write/range-scan performance. Ad-hoc queries (e.g., 'find all messages from user X in any conversation') require secondary indexes or full scans. Messenger restricts queries to conversation-scoped access.
Meta built their own HBase-based message store called ZippyDB and later Apache Hive/RocksDB derivatives. WhatsApp stores messages client-side; Messenger stores server-side for cross-device sync.
100B messages/day. Avg message: 500 bytes. Daily ingest: 50 TB. With 30-day retention: 1.5 PB. Partitioned across thousands of HBase RegionServers.
§2Step 3 — Deep Dive
WebSocket gateways maintain persistent TCP connections from Messenger clients. Each gateway handles a slice of the user population (sharded by userID). A session registry maps userID+deviceID to the gateway currently holding that socket.
| Protocol | Overhead | Battery impact | Message ordering | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| MQTT over TLS | 2-byte fixed header | Very low (keep-alive ping) | Per-topic ordered | Mobile messaging, IoT, billions of devices ✓ | Low | Low |
| WebSocket | Variable frame header | Medium (persistent TCP) | Application-level | Web chat, real-time dashboards | Low | Medium |
| HTTP/2 push | HTTP headers (~50B) | Low (multiplexed) | Stream-level | Web push notifications, PWA | Low | Low |
| Long-polling | ~800B HTTP header | High (new conn per msg) | None (client-ordered) | Firewall-friendly legacy fallback | Low | Low |
| XMPP | XML overhead (~500B) | Medium | None built-in | Legacy enterprise IM (Jabber) | Low | Low |
Mobile messaging protocols — MQTT wins for battery-constrained always-on connections.
import time
import happybase
conn = happybase.Connection('hbase-thrift:9090')
table = conn.table('messages')
def store_message(conv_id: str, msg_id: str, sender_id: str, content: bytes):
ts = int(time.time() * 1000)
# Reverse timestamp: newer messages have smaller row keys -> efficient range scan
rk = f"{conv_id}#{(2**63 - ts):020d}#{msg_id}".encode()
table.put(rk, {
b'msg:sender_id': sender_id.encode(),
b'msg:content': content,
b'msg:type': b'text',
b'msg:status': b'delivered',
})
def get_recent_messages(conv_id: str, limit: int = 50):
prefix = f"{conv_id}#".encode()
return list(table.scan(row_prefix=prefix, limit=limit))
# Inbox fan-out: write to each participant's inbox table
# For group chats >1000 members: async fan-out via message queue
# Read receipts: separate HBase table, updated on delivery ACK from MQTT broker| Component | Why Add It | Tradeoff |
|---|---|---|
| Additional WebSocket Gateways | Horizontal sharding of connections prevents any single gateway from becoming a bottleneck. | Adding gateways increases the cost of global broadcasts (e. |
| Kafka for Message Routing | Kafka decouples the write path from delivery. | Async delivery via Kafka adds 2–10ms delivery latency vs direct synchronous delivery. |
| Redis Tiers | Without a session cache, delivering a message to user A requires broadcasting to all gateways and checking if A is connected — O(gateways) per delivery. | Session cache entries can become stale if a gateway crashes without cleanly evicting connections. |
| Wide-Column Message Store | Messages are write-heavy and read in conversation-order chunks (pagination). | Wide-column stores trade query flexibility for write/range-scan performance. |
Design decision tradeoffs
ws-gateway-a crashes. 100K users lose their long-lived WebSocket connections. All clients attempt to reconnect simultaneously, causing a thundering herd on remaining gateways. How do you implement backoff-with-jitter reconnect, connection state persistence in cache-1, and load shedding on the surviving gateways?
db-1 (message store) is partitioned from ws-gateway-a. Messages are sent by users but can't be persisted. How do you buffer messages in cache-1, deduplicate on recovery, and reconcile message ordering when db-1 reconnects?
A group with 250 members all send messages simultaneously during a live event. Each message requires fan-out to 249 other members' WebSocket connections. How do you implement server-side pub/sub, delivery receipts, and deferred fan-out for offline members?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Additional WebSocket Gateways | WebSocket gateways maintain persistent TCP connections from Messenger clients. | Horizontal sharding of connections prevents any single gateway from becoming a bottleneck. |
| Kafka for Message Routing | The Chat API persists the message to the NoSQL store, then publishes to Kafka (keyed by conversationID). | Kafka decouples the write path from delivery. |
| Redis Tiers | Session cache: maps (userID, deviceID) → (gatewayID, connectionTimestamp). | Without a session cache, delivering a message to user A requires broadcasting to all gateways and checking if A is connected — O(gateways) per delivery. |
| Wide-Column Message Store | Each row key = conversationID. | Messages are write-heavy and read in conversation-order chunks (pagination). |
Key design decisions