Open on desktop

Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.

Best on desktop

Back to lesson

Discord Real-Time

advancedRealtimeWebRTC

Large-Scale·75 min read

Discord Real-Time

1Understand the Problem & Establish Design Scope→2High-Level Design→3Deep Dive→4Wrap Up

RealtimeWebRTC

§1Step 2 — High-Level Design

2High-Level Design

Power 500M users across voice, video, and text. WebRTC at scale, server regions, and bot infrastructure.

System architecture overview

Stage 1 of 5Starting state — the problem to solve

Progressive build — add each component step by step

Add Additional WebSocket Gateway Shards

Add more WebSocket gateway shards beyond the starter. Discord shards connections by guild ID so each gateway only handles a subset of servers.

What it does

WebSocket gateway shards maintain persistent connections from Discord clients. Sharding by guild ID means all members of a guild connect to the same shard, simplifying event fan-out (presence updates, messages) within a guild.

Why it matters

A single WebSocket server cannot maintain millions of persistent connections. Sharding distributes the connection load. Guild-aware sharding also means the gateway can fan-out events to guild members without a distributed lookup — all their connections are local.

Trade-off

Guild-based sharding causes hotspots for very large guilds (Discord's largest have 800K members). Discord handles this with large-guild isolation: huge guilds get their own dedicated shard pool.

Real world

Discord runs 100K+ shard processes globally. When a shard crashes, all members of the affected guilds see temporary disconnects. Discord's client uses exponential backoff reconnect to avoid thundering herd.

Capacity math

~2500 guilds per shard (Discord's published sharding factor). At 19M active servers: ~7600 shards. Each shard: ~64K concurrent connections at peak.

In the real world: Discord runs 100K+ shard processes globally. When a shard crashes, all members of the affected guilds see temporary disconnects. Discord's client uses exponential backoff reconnect to avoid thundering herd.

Add Kafka Message Queue

Add Kafka to asynchronously route messages, presence events, and voice state updates between gateway shards and backend services.

What it does

Kafka partitions events by guild ID. Gateway shards publish message events, presence updates, and typing indicators to Kafka. Fan-out services consume from Kafka and push to the appropriate shard(s) for delivery.

Why it matters

Without Kafka, gateway shards would need to directly coordinate with each other — requiring all-to-all communication as shard count grows. Kafka provides a central bus that decouples producers from consumers and provides durability.

Trade-off

Kafka adds latency (~2–5ms) vs direct shard-to-shard messaging. Discord accepts this latency for durability and operational simplicity, keeping Kafka for message routing and using Redis Pub/Sub for low-latency presence.

Real world

Discord processes 4 billion minutes of video per day and millions of messages per second through Kafka pipelines. They use separate Kafka clusters for messages vs analytics.

Capacity math

Discord's Kafka: millions of events/second. Message latency via Kafka: <10ms p99. Kafka cluster: 60+ brokers at peak.

In the real world: Discord processes 4 billion minutes of video per day and millions of messages per second through Kafka pipelines. They use separate Kafka clusters for messages vs analytics.

Add Redis Tiers

Add separate Redis tiers for session routing (which shard holds a user's connection) and for fast presence state (online/idle/dnd/offline).

What it does

Session cache: maps userID → shardID so the fan-out service can route delivery to the correct WebSocket gateway. Presence cache: stores each user's online status with short TTL — gateways heartbeat to refresh; expiry signals disconnect.

Why it matters

When shard A receives a message for a guild, it must fan-out to all online guild members — who may be on any shard. Without a session cache, delivery requires broadcasting to all shards. With it, only the owning shard receives the event.

Trade-off

Presence updates are extremely high volume (heartbeat every 30s from every online user). Discord uses Redis Cluster with read replicas to handle the read fan-out from guild presence aggregation.

Real world

Discord's 'Friends List' presence runs on Redis. At peak, 10M+ users are online simultaneously, generating 350K presence updates/second across the Redis cluster.

Capacity math

Session cache: ~100 bytes/user × 10M online users = 1 GB. Presence TTL: 60 seconds. Redis Pub/Sub throughput: 1M+ messages/second per node.

In the real world: Discord's 'Friends List' presence runs on Redis. At peak, 10M+ users are online simultaneously, generating 350K presence updates/second across the Redis cluster.

Add Fan-Out Service

Add a dedicated fan-out service that expands a single message event into per-shard delivery instructions for all online guild members.

What it does

The fan-out service consumes from Kafka, looks up guild membership and online status from the presence cache, groups recipients by shard, and publishes per-shard delivery batches. Each shard receives only the events for its connected users.

Why it matters

Fan-out is O(guild size) work. For large guilds, doing this synchronously in the gateway would cause write latency to spike. An async fan-out service isolates this work and can scale horizontally independent of gateways.

Trade-off

Large guild fan-out is inherently expensive. Discord limits some event types (like full presence lists) for guilds >250 members, switching to on-demand presence fetching rather than push-based fan-out.

Real world

Discord's 'large guild' feature flag changes delivery semantics for servers >250 members. Events that would require 250K+ fan-out operations are rate-limited or replaced with summarized payloads.

Capacity math

Fan-out workers: horizontally scalable. For 800K member guild: 800K delivery instructions per message. Fan-out rate: ~100K events/second per worker process.

In the real world: Discord's 'large guild' feature flag changes delivery semantics for servers >250 members. Events that would require 250K+ fan-out operations are rate-limited or replaced with summarized payloads.

Gateway Shard Crash: gateway-shard-a crashes with 50K connected WebSocket clients. All those clients lose their connections and must reconnect. If all 50K reconnect simultaneously, they overwhelm edge-gateway and create a thundering herd. How do you implement staggered reconnect with jitter, session resumption, and connection draining before planned failures?

§2Step 3 — Deep Dive

3Deep Dive

Protocol	Direction	Latency	Connection overhead	Best for	Cost	Ops burden
WebSocket	Full-duplex	~5ms	One TCP conn per client	Chat, gaming, live collab ✓	Low	Medium
Server-Sent Events (SSE)	Server -> Client only	~10ms	One HTTP conn	Dashboards, notifications (read-only)	Low	Low
Long-polling	Request-response	~100ms	New HTTP req per message	Legacy fallback, firewall-friendly	Low	Low
MQTT	Pub/sub, full-duplex	~2ms	Lightweight 2-byte header	IoT, mobile (battery), billions of devices	Low	Low
gRPC streaming	Full-duplex	~3ms	HTTP/2 multiplexed	Internal services, typed proto contracts	Low	Medium

Real-time message delivery — WebSocket wins for full-duplex persistent connections.

typescriptGateway fan-out — pub/sub dispatch to WebSocket sessions

import WebSocket from 'ws'
import Redis from 'ioredis'

const pub = new Redis()
const sub = new Redis()

// In-memory map: channelId -> Set of live WebSocket connections
const subscribers = new Map<string, Set<WebSocket>>()

// Subscribe to Redis pub/sub for cross-server fan-out
sub.subscribe('discord:messages')
sub.on('message', (_channel, raw) => {
  const { channelId, payload } = JSON.parse(raw)
  const conns = subscribers.get(channelId)
  if (!conns) return

  const frame = JSON.stringify(payload)
  for (const ws of conns) {
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(frame)
    } else {
      conns.delete(ws)   // lazy cleanup of dead connections
    }
  }
})

// When a user sends a message
async function dispatch(channelId: string, message: object) {
  const payload = { ...message, ts: Date.now() }
  // Publish to all gateway servers -- each fans out to its local subscribers
  await pub.publish('discord:messages', JSON.stringify({ channelId, payload }))
}

// ~500K WebSocket connections per gateway server
// Cassandra stores messages durably; Redis is ephemeral delivery only

Component	Why Add It	Tradeoff
Additional WebSocket Gateway Shards	A single WebSocket server cannot maintain millions of persistent connections.	Guild-based sharding causes hotspots for very large guilds (Discord's largest have 800K members).
Kafka Message Queue	Without Kafka, gateway shards would need to directly coordinate with each other — requiring all-to-all communication as shard count grows.	Kafka adds latency (~2–5ms) vs direct shard-to-shard messaging.
Redis Tiers	When shard A receives a message for a guild, it must fan-out to all online guild members — who may be on any shard.	Presence updates are extremely high volume (heartbeat every 30s from every online user).
Fan-Out Service	Fan-out is O(guild size) work.	Large guild fan-out is inherently expensive.

Design decision tradeoffs

Gateway Shard Crash

gateway-shard-a crashes with 50K connected WebSocket clients. All those clients lose their connections and must reconnect. If all 50K reconnect simultaneously, they overwhelm edge-gateway and create a thundering herd. How do you implement staggered reconnect with jitter, session resumption, and connection draining before planned failures?

Gateway-Cache Partition

A network partition isolates cache-1 from the gateway shards. Presence updates can't be written, so all users appear offline to others. How do you implement a local presence cache on each gateway shard, serve stale presence data, and reconcile when connectivity resumes?

Large Server Message Fan-Out

A Discord server with 500K members all online receives a single message. The gateway must fan out that event to all 500K concurrent WebSocket connections simultaneously. How do you implement pub/sub with efficient delivery trees to avoid saturating any single gateway shard?

Treat the gateway fleet as a sharded stateful tier, not a single generic API box. Clients enter through an edge gateway and are pinned onto multiple WebSocket shards. Keep session routing in Redis so guild events can be delivered only to the shards that currently host connected members.

Large-guild presence cannot be emitted as one event per user per subscriber. Publish guild events into Kafka, aggregate them in a dedicated presence service, and store compressed or batched presence summaries in Redis. Small guilds can be more detailed; very large guilds need coarse-grained presence updates.

Voice should use an SFU-style media router separate from the text gateway path. The voice API assigns users to a voice region or SFU cluster, and the SFU forwards streams without doing heavy transcoding. Keep message history in an append-friendly store, but keep guild membership and permission metadata in a relational system.

§3Step 4 — Wrap Up

4Wrap Up

Decision	Choice	Why
Additional WebSocket Gateway Shards	WebSocket gateway shards maintain persistent connections from Discord clients.	A single WebSocket server cannot maintain millions of persistent connections.
Kafka Message Queue	Kafka partitions events by guild ID.	Without Kafka, gateway shards would need to directly coordinate with each other — requiring all-to-all communication as shard count grows.
Redis Tiers	Session cache: maps userID → shardID so the fan-out service can route delivery to the correct WebSocket gateway.	When shard A receives a message for a guild, it must fan-out to all online guild members — who may be on any shard.
Fan-Out Service	The fan-out service consumes from Kafka, looks up guild membership and online status from the presence cache, groups recipients by shard, and publishes per-shard delivery batches.	Fan-out is O(guild size) work.

Key design decisions

If the interviewer asks to scale 10×: 10x the load — architectural moves that work. Identify the single bottleneck (usually the database write path) and address it first before horizontal scaling.

10× Target20.0M RPSwhere your architecture must hold

What's next

Realtime

Real-Time Leaderboard