Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Real-Time Chat System
§1Step 2 — High-Level Design
Build a chat system supporting 50M concurrent users. WebSockets, presence, and message ordering.
Connect Redis to both WebSocket gateways. Messages published to a room channel are delivered to all subscribers on any gateway.
Redis pub/sub is a lightweight messaging layer where clients (WebSocket gateways) subscribe to named channels (one per chat room) and publish messages to them. When WS Gateway 1 receives a user message, it publishes to 'room:room-id:messages'. Both WS Gateway 1 and WS Gateway 2 are subscribed to the same channel and receive the published message, allowing them to fan-out to their respective connected clients. This creates a shared message bus across all WebSocket servers without requiring direct client-to-client communication. Redis maintains subscription state in memory, making pub/sub operations extremely fast (sub-millisecond), but messages are not persisted—only clients actively subscribed at publish time receive them.
At scale, hundreds of WebSocket servers handle different clients. Without a shared message bus, users in the same room but on different servers would miss each other's messages. Each gateway would only communicate with its local clients, breaking the illusion of a unified chat room. Redis pub/sub solves this by acting as a central fanout hub: one publish reaches all subscribed gateways simultaneously, and each gateway delivers to its connected users. This architecture allows linear scaling of client connections across multiple servers while keeping message latency low.
Redis pub/sub has no persistence — messages are lost if no subscriber is connected at publish time. If a client is offline and a message arrives, they won't see it in pub/sub (only in message history if stored separately). This is acceptable for live chat (users expect to miss messages when offline) but problematic for critical notifications. Additionally, pub/sub creates memory overhead on Redis (subscription state) and network overhead if a single room has users across 100+ gateways (Redis broadcasts to all subscribers). For extremely large fan-outs (>10K subscribers per channel), single-channel architecture can become a bottleneck.
Discord uses Redis pub/sub for real-time message fan-out across their WebSocket servers. Each Discord guild (server) has a Redis channel, and all connected gateways subscribe to channels their users have open. Slack uses a similar model with additional deduplication to handle retries. WhatsApp routes through a custom pub/sub system built on their erlang infrastructure for handling billions of messages daily. The pattern is proven for 1M+ concurrent users as long as hotspots are partitioned.
Redis pub/sub handles 1M+ messages/second per node with sub-millisecond latency. A single channel can sustain 10K+ simultaneous subscribers without performance degradation. Fan-out to 1000 subscribers takes <1ms. Memory overhead is minimal (~100 bytes per subscription). Network overhead for publish is O(1)—Redis sends one publish command regardless of subscriber count, broadcasting internally.
Route messages through Kafka before Postgres. Kafka buffers write spikes, enables replay for offline users, and decouples WS gateways from the database.
Kafka acts as a buffering layer between WebSocket gateways and the persistent message store. When a user sends a message, the WS gateway simultaneously publishes to Redis pub/sub (for real-time delivery) and to a Kafka topic (for durability). Kafka persists the message to disk across multiple brokers before acknowledging the write, ensuring no loss even if nodes fail. A separate set of consumer processes reads from the Kafka topic and writes to Postgres asynchronously. This dual-path architecture decouples the fast path (pub/sub for live delivery) from the slow path (database writes for history). Messages are available immediately to online users via Redis, while persistence happens in the background.
Postgres writes are synchronous and slow (5-50ms depending on storage). When message volume spikes (e.g., 100K messages/second during an event), direct writes to Postgres cause database contention and p99 latency to spike to seconds. Kafka absorbs the burst: writes to Kafka are fast (1-5ms) because Kafka uses sequential disk I/O and batching. The database consumers then persist messages at a steady rate, preventing write storms. This also enables offline message replay—consumers can read Kafka history and construct a user's message history on demand, and new users joining a room can replay recent messages without querying the full database.
Kafka adds operational complexity (cluster management, topic/partition configuration) and introduces eventual consistency—there's a window where a message is in Redis but not yet in Postgres. If a consumer crashes, messages may be reprocessed and persisted twice (requiring deduplication on read). Additionally, Kafka consumes disk space (messages are retained for X days) and network bandwidth (replication across brokers). For low-volume systems (<1K messages/second), Kafka overhead exceeds benefit; PostgreSQL alone may be sufficient.
WhatsApp and Facebook Messenger use Kafka extensively for message durability. Discord uses Cassandra (instead of Postgres) accessed through Kafka consumers to handle their 4B message/day volume. Slack uses a similar pattern with message queuing before database persistence. LinkedIn's messaging platform processes billions of messages daily through Kafka. The pattern is industry-standard for any chat system at scale.
Kafka handles millions of messages/second per partition with sub-5ms latency. A single 3-broker cluster can sustain 500MB/s throughput with replication. Messages are retained on disk for 7-30 days depending on configuration. Consumers can replay history at full line rate, and multiple consumer groups can read the same topic independently without interference.
Add a presence service (Redis-backed) to track online/offline status and route notifications to push gateways for offline users.
A presence service maintains a real-time map of user ID to WebSocket server assignment and connection state. When a user opens a WebSocket connection to WS Gateway 1, the gateway stores 'user:123 → ws-1:socket-456' in the presence service (typically Redis hash maps). The service also tracks online/offline status and last-seen timestamp. Before fan-out, the gateway checks presence: if a recipient is online (present in the map), deliver via WebSocket to their gateway; if offline, route to a push notification service (APNs/FCM). This enables targeted delivery—real-time for online users, push notifications for offline users.
Fan-out to offline users is wasteful. If you pub/sub a message to a room with 1M users but 50% are offline, you're wasting 500K publish operations to users who won't read the message. Presence-aware routing optimizes fan-out: only deliver via WebSocket to online users (on-demand, high priority) and queue push notifications for offline users separately (lower priority, batched). This also solves the problem of stale subscriptions—if a user closes their app but the WebSocket doesn't cleanly disconnect, the gateway can check presence to detect they're truly offline and not send them messages.
Presence state is eventually consistent. A user may appear online for a few seconds after actually disconnecting (the heartbeat hasn't fired yet). Heartbeat intervals (typically 30-60 seconds) control the lag—shorter intervals reduce stale presence but increase Redis write load. Additionally, presence lookups before every message add latency (1-2ms per lookup) unless you cache presence locally on each gateway. The service itself requires careful handling of connection/disconnection events to avoid stale entries.
WhatsApp's presence system uses Erlang processes tracking 2B+ users globally, with heartbeats every 30-60 seconds. Discord uses a distributed presence system with 8.5M concurrent voice/video connections tracked in real-time. Slack's presence service powers their status indicators and notification routing. Telegram uses presence for their 'last seen' timestamps. Presence is essential for modern chat systems.
Redis can track 1M active WebSocket sessions with hash maps at <100MB memory (each entry is ~100 bytes). Heartbeat writes at 30-second intervals produce ~33K writes/second per 1M users. Presence lookups are O(1) and take <1ms. Batch heartbeats (send all user heartbeats in a single pipelined command) to reduce round-trips.
§2Step 3 — Deep Dive
Redis pub/sub is a lightweight messaging layer where clients (WebSocket gateways) subscribe to named channels (one per chat room) and publish messages to them. When WS Gateway 1 receives a user message, it publishes to 'room:room-id:messages'. Both WS Gateway 1 and WS Gateway 2 are subscribed to the same channel and receive the published message, allowing them to fan-out to their respective connected clients. This creates a shared message bus across all WebSocket servers without requiring direct client-to-client communication. Redis maintains subscription state in memory, making pub/sub operations extremely fast (sub-millisecond), but messages are not persisted—only clients actively subscribed at publish time receive them.
| Approach | Latency | Persistence | Fan-out | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Long polling | 500ms–2s | Yes (DB) | Poor | Legacy browsers, simple cases | Low | Low |
| Server-Sent Events | < 100ms | Yes | Server→Client only | Notifications, feeds | Low | Low |
| WebSocket + Redis pub/sub | < 50ms | Via Kafka | Excellent | Chat, live collaboration ✓ | Medium | Medium |
| WebSocket + Kafka direct | < 100ms | Yes | Good | High-durability chat | High | Medium |
| gRPC streaming | < 20ms | Manual | Good | Internal microservices | Low | Medium |
Real-time message delivery strategies — WebSocket + Redis pub/sub is standard.
import { WebSocketServer, WebSocket } from 'ws'
import { createClient } from 'redis'
const wss = new WebSocketServer({ port: 8080 })
const publisher = createClient()
const subscriber = createClient()
await publisher.connect()
await subscriber.connect()
// Map room → set of connected WebSocket clients (local to this gateway)
const rooms = new Map<string, Set<WebSocket>>()
wss.on('connection', (ws) => {
let currentRoom: string | null = null
ws.on('message', async (data) => {
const msg = JSON.parse(data.toString())
if (msg.type === 'join') {
currentRoom = msg.room
if (!rooms.has(currentRoom)) rooms.set(currentRoom, new Set())
rooms.get(currentRoom)!.add(ws)
// Subscribe to room channel — delivers messages from other gateways
await subscriber.subscribe(`room:${currentRoom}`, (message) => {
rooms.get(currentRoom!)?.forEach(client => {
if (client.readyState === WebSocket.OPEN) client.send(message)
})
})
}
if (msg.type === 'message' && currentRoom) {
const payload = JSON.stringify({ room: currentRoom, text: msg.text, ts: Date.now() })
// Publish to Redis — all gateways subscribed to this room will fan out
await publisher.publish(`room:${currentRoom}`, payload)
}
})
ws.on('close', () => {
if (currentRoom) rooms.get(currentRoom)?.delete(ws)
})
})| Component | Why Add It | Tradeoff |
|---|---|---|
| Redis Pub/Sub for Cross-Gateway Fan-out | At scale, hundreds of WebSocket servers handle different clients. | Redis pub/sub has no persistence — messages are lost if no subscriber is connected at publish time. |
| Kafka for Durable Message Queue | Postgres writes are synchronous and slow (5-50ms depending on storage). | Kafka adds operational complexity (cluster management, topic/partition configuration) and introduces eventual consistency—there's a window where a message is in Redis but not yet in Postgres. |
| Presence Service | Fan-out to offline users is wasteful. | Presence state is eventually consistent. |
Design decision tradeoffs
WS Gateway 1 crashes. Clients on it are disconnected — how does the system reconnect them to WS Gateway 2 and resume delivery?
A popular room (1M users) receives 100K messages/second, creating a bottleneck at the Redis channel. All gateways backlog on Redis pub/sub — message delivery latency hits 5s. How do you shard rooms across Redis partitions?
WS Gateway 1 loses network access to Redis but remains connected to clients. Clients send messages but they don't fan-out. How do you detect partition, route traffic to healthy gateways, and queue unsent messages?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Redis Pub/Sub for Cross-Gateway Fan-out | Redis pub/sub is a lightweight messaging layer where clients (WebSocket gateways) subscribe to named channels (one per chat room) and publish messages to them. | At scale, hundreds of WebSocket servers handle different clients. |
| Kafka for Durable Message Queue | Kafka acts as a buffering layer between WebSocket gateways and the persistent message store. | Postgres writes are synchronous and slow (5-50ms depending on storage). |
| Presence Service | A presence service maintains a real-time map of user ID to WebSocket server assignment and connection state. | Fan-out to offline users is wasteful. |
Key design decisions