Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Circuit Breaker
§1Step 2 — High-Level Design
Prevent cascading failures with the circuit breaker pattern. States: closed, open, half-open.
Place a Circuit Breaker between the API server and downstream payment/order services to stop cascading failures.
A Circuit Breaker is a state machine with three states: Closed (passing requests), Open (returning immediate failures), and Half-Open (testing if the service recovered).
Without a circuit breaker, when payment-svc slows down, api-1 threads pile up waiting for responses. Thread exhaustion causes api-1 to fail too — a cascading failure that takes down the entire system.
You must define fallback behavior for when the circuit is open. Some requests (like payments) have no safe fallback and must be queued or returned as errors.
Netflix Hystrix (now Netflix Resilience4j), AWS App Mesh, and Istio all implement circuit breaking. Stripe uses circuit breakers between their API and payment processors.
A circuit breaker adds < 1ms overhead per request. Typically configured with: error threshold 50%, window 10s, sleep duration 30s.
At high traffic, the circuit breaker needs fallback data when tripping open. A cache provides stale-but-valid responses.
A Redis cache stores recent successful responses that can be served when the circuit is open.
Returning errors when downstream is degraded is worse than returning slightly stale data. A cache makes the circuit breaker gracefully degrade.
Stale data may be shown to users. Set a reasonable max-age (e.g., 60s) and indicate staleness in the response.
Netflix's Hystrix and Resilience4j both support result caching as a fallback strategy.
Cache hit rate should be 90%+ for hot paths. A 4GB Redis instance handles millions of cached responses.
At peak load, run multiple circuit-breaker-enabled API instances behind a load balancer.
A load balancer distributes traffic across multiple API server instances, each with their own circuit breaker.
At peak, a single instance can't handle the throughput. Horizontal scaling multiplies capacity linearly.
Each instance has independent circuit state — one may be open while others are closed. This is usually fine (partial degradation).
Microservices at Uber and Lyft run hundreds of instances, each with per-instance circuit breakers.
Each API server handles 5-10K RPS. Three instances = 15-30K RPS before circuit breakers activate.
§2Step 3 — Deep Dive
A Circuit Breaker is a state machine with three states: Closed (passing requests), Open (returning immediate failures), and Half-Open (testing if the service recovered).
| Pattern | Stops cascade? | Recovery | Complexity | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Timeout only | No — threads still block | Manual | Low | Simple services, low QPS | Low | Low |
| Retry with backoff | No — amplifies load | Automatic | Low | Transient network errors | Low | Low |
| Circuit Breaker | Yes — fast-fails immediately | Automatic (half-open) | Medium | Service-to-service calls ✓ | Low | Medium |
| Bulkhead | Partial — limits blast radius | Automatic | Medium | Thread/connection pool isolation | Low | Medium |
| Circuit Breaker + Bulkhead | Yes — full isolation | Automatic | High | Critical payment/auth paths | Low | High |
Failure isolation patterns — Circuit Breaker is the production standard.
import time
from enum import Enum
class State(Enum):
CLOSED = "closed" # normal — requests flow through
OPEN = "open" # tripped — fast-fail all requests
HALF_OPEN = "half_open" # testing — allow one probe request
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.state = State.CLOSED
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
def call(self, fn, *args):
if self.state == State.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = State.HALF_OPEN # probe
else:
raise Exception("Circuit OPEN — fast failing")
try:
result = fn(*args)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
self.failure_count = 0
self.state = State.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = State.OPEN| Component | Why Add It | Tradeoff |
|---|---|---|
| Circuit Breaker | Without a circuit breaker, when payment-svc slows down, api-1 threads pile up waiting for responses. | You must define fallback behavior for when the circuit is open. |
| Cache for Fallback Responses | Returning errors when downstream is degraded is worse than returning slightly stale data. | Stale data may be shown to users. |
| Load Balancer | At peak, a single instance can't handle the throughput. | Each instance has independent circuit state — one may be open while others are closed. |
Design decision tradeoffs
Payment Service becomes unavailable. Circuit breaker should open and Order Service should continue.
payment-svc response time spikes from 20ms to 8 seconds. Without a circuit breaker, all api-1 threads block waiting for payment-svc, exhausting the thread pool and making api-1 unavailable for all traffic. How does the circuit breaker open to fast-fail requests and protect api-1?
payment-svc becomes slow. Clients time out and retry. Each retry adds more load to the already-struggling service, creating a feedback loop that takes it down entirely. How do you add jitter to retry backoff and enforce circuit breaker open state to prevent retry storms?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Circuit Breaker | A Circuit Breaker is a state machine with three states: Closed (passing requests), Open (returning immediate failures), and Half-Open (testing if the service recovered). | Without a circuit breaker, when payment-svc slows down, api-1 threads pile up waiting for responses. |
| Cache for Fallback Responses | A Redis cache stores recent successful responses that can be served when the circuit is open. | Returning errors when downstream is degraded is worse than returning slightly stale data. |
| Load Balancer | A load balancer distributes traffic across multiple API server instances, each with their own circuit breaker. | At peak, a single instance can't handle the throughput. |
Key design decisions