Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Netflix Architecture
§1Step 2 — High-Level Design
Stream to 250M subscribers globally. Chaos engineering, recommendation at scale, and regional failover.
Add Netflix Open Connect CDN to serve video content from appliances deployed directly in ISP networks.
Open Connect Appliances (OCAs) are Netflix-owned servers deployed in ISP data centers and Internet Exchange Points. During off-peak hours, OCAs proactively cache the most-watched titles in each region. During peak hours, 95%+ of Netflix traffic is served directly from OCAs — never touching Netflix's AWS infrastructure for the video payload.
Netflix accounts for ~15% of global internet traffic. Without ISP-embedded appliances, this traffic would traverse backbone networks, increasing costs for Netflix and ISPs. Open Connect reduces transit costs and improves latency by serving from within the user's ISP.
Open Connect requires Netflix to manage hardware deployments in 1000+ ISP locations globally. It's a significant operational investment but pays off: OCA-served traffic has <1% rebuffering vs ~5% for non-OCA delivery.
Netflix has 1000+ ISP partners with Open Connect deployments. An OCA appliance holds 100–280 TB of content (SSD + HDD). On a typical evening, a large OCA serves 100 Gbps+ of video to local subscribers.
OCA storage: 100–280 TB. OCA throughput: 100 Gbps peak. Open Connect hit rate: 95%+ of Netflix traffic. Proactive caching: fills OCAs with next-day popular titles during off-peak.
Add Netflix's Zuul API gateway for dynamic routing, rate limiting, and authentication of all API traffic.
Zuul 2.0 is Netflix's non-blocking API gateway running on Netty. It handles: SSL termination, authentication (device certificates, user tokens), routing to backend microservices (200+ services), rate limiting, canary deployments (A/B routing), and chaos engineering (Chaos Monkey integration).
Netflix has 200+ microservices. Without a unified gateway, each service would need auth, rate limiting, and monitoring. Zuul centralizes these cross-cutting concerns and provides a single point for traffic management and observability.
A centralized gateway is a potential bottleneck. Netflix runs Zuul in multiple AWS regions with auto-scaling. Each Zuul instance handles 100K+ concurrent requests. The non-blocking Netty model allows high concurrency with low thread count.
Zuul was open-sourced by Netflix in 2012. Netflix runs hundreds of Zuul instances across multiple AWS regions. At peak, Netflix handles 2M+ API requests/second through Zuul.
Zuul throughput: 100K requests/second per instance. Request processing overhead: <5ms. Netflix peak API traffic: 2M+ RPS. Auto-scales to 1000+ instances during peak streaming hours.
Add Kafka to carry viewing events, recommendation signals, and A/B test metrics from playback clients to backend processing pipelines.
Kafka at Netflix (called 'Keystone') handles: play events (start, stop, pause, seek per video), viewing completion events, rating changes, search queries, and A/B assignment events. These feed into Spark Streaming jobs that update recommendation features.
Netflix's recommendation algorithm drives 80% of watched content. The algorithm improves with more engagement signals. Kafka enables real-time signal collection at scale without coupling the playback API to recommendation infrastructure.
Kafka adds 2–5ms to signal propagation. Netflix's recommendation model tolerates this — it runs on batch-updated features (hourly/daily) with real-time signals refreshed every few minutes from Kafka consumer jobs.
Netflix's Keystone Kafka cluster processes 2+ trillion events per day. Their open-source Mantis streaming platform processes real-time operational events from Kafka. Kafka is also used for cross-region replication of metadata.
Keystone Kafka: 2T events/day. Peak ingest: 30M events/second. 100+ Kafka brokers. Event size: avg 200 bytes. Retention: 2 days for real-time consumers, 7 days for batch replay.
§2Step 3 — Deep Dive
Open Connect Appliances (OCAs) are Netflix-owned servers deployed in ISP data centers and Internet Exchange Points. During off-peak hours, OCAs proactively cache the most-watched titles in each region. During peak hours, 95%+ of Netflix traffic is served directly from OCAs — never touching Netflix's AWS infrastructure for the video payload.
| Pattern | Availability | Complexity | Data consistency | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Active-active (Netflix) | 99.99%+ (any region serves) | Very high | Eventual (async replication) | Global streaming, stateless workloads ✓ | High | High |
| Active-passive (hot standby) | 99.9% (failover ~30s) | Medium | Strong (sync replication) | Financial systems, low write volume | High | High |
| Active-passive (cold standby) | 99.5% (failover ~5min) | Low | Point-in-time backup | DR only, cost-sensitive | Medium | Medium |
| Multi-master (CockroachDB) | 99.99% | High | Strong (Paxos) | Global SQL, financial transactions | High | High |
| Chaos-tested active-active | 99.999% (battle-hardened) | Very high | Eventual | Netflix-scale, Chaos Monkey validated ✓ | High | High |
Multi-region availability patterns — active-active wins for zero-downtime regional failures.
import io.github.resilience4j.circuitbreaker.*;
import java.time.Duration;
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(80)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowSize(20)
.build();
CircuitBreaker cb = CircuitBreaker.of("recommendation-service", config);
// Decorate the call -- if CB is OPEN throws CallNotPermittedException
Supplier<List<Video>> decorated = CircuitBreaker.decorateSupplier(cb,
() -> recommendationService.getPersonalizedFeed(userId));
// Fallback: serve cached popular titles when recommendations are down
List<Video> feed = Try.ofSupplier(decorated)
.recover(CallNotPermittedException.class, e -> getCachedPopularTitles())
.recover(TimeoutException.class, e -> getCachedPopularTitles())
.get();
// Chaos Engineering: Chaos Monkey randomly kills instances in prod
// If the system survives -> circuit breakers, fallbacks, retries work| Component | Why Add It | Tradeoff |
|---|---|---|
| CDN (Open Connect) | Netflix accounts for ~15% of global internet traffic. | Open Connect requires Netflix to manage hardware deployments in 1000+ ISP locations globally. |
| API Gateway (Zuul) | Netflix has 200+ microservices. | A centralized gateway is a potential bottleneck. |
| Kafka for Event Streaming | Netflix's recommendation algorithm drives 80% of watched content. | Kafka adds 2–5ms to signal propagation. |
Design decision tradeoffs
One regional API fleet goes down. Traffic should continue through the surviving region while video playback stays on CDN.
A major show drops at midnight, causing 10x normal traffic to cdn-1 for that content. CDN cache miss rate spikes as edge nodes evict other content to cache the new show. How do you pre-position content at edge nodes before release and implement an origin shield to protect playback-api-a?
control-api loses connectivity to playback-api-a. New playback sessions can't be authorized. However, existing sessions continue streaming from cdn-1. How do you implement playback tokens with TTL so in-flight streams continue, and queue session requests for retry when connectivity restores?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| CDN (Open Connect) | Open Connect Appliances (OCAs) are Netflix-owned servers deployed in ISP data centers and Internet Exchange Points. | Netflix accounts for ~15% of global internet traffic. |
| API Gateway (Zuul) | Zuul 2. | Netflix has 200+ microservices. |
| Kafka for Event Streaming | Kafka at Netflix (called 'Keystone') handles: play events (start, stop, pause, seek per video), viewing completion events, rating changes, search queries, and A/B assignment events. | Netflix's recommendation algorithm drives 80% of watched content. |
Key design decisions