Open on desktop
Antimetal's interactive diagrams require a larger screen. Open this page on your laptop or desktop to continue.
Service Discovery
§1Step 2 — High-Level Design
Design client-side and server-side service discovery. Compare Consul, Eureka, and Kubernetes DNS.
Place a service registry (Consul or Zookeeper) that all services register with and query for peer locations.
A Service Registry is a database of service instances: their names, network addresses, health status, and metadata. Services register on startup and query it to discover peers.
In a dynamic environment, services scale up and down constantly and their IPs change. Hardcoding IP addresses breaks on every deployment. The registry provides a consistent name-to-address mapping that updates automatically as instances come and go.
The service registry becomes a critical dependency — if it's unavailable, services can't discover each other. Run it with 3-5 replicas (Raft consensus) for high availability.
Netflix uses Eureka. Kubernetes uses etcd + CoreDNS. HashiCorp Consul is used by GitHub, Twitter, and Expedia for service mesh and discovery.
Consul handles 10,000+ service registrations with sub-millisecond health check updates. At 1-second TTL heartbeats, 10K services = 10K heartbeats/second — trivial for a Consul cluster.
At high traffic, cache service registry lookups so clients don't hammer the discovery service on every RPC call.
A Redis cache stores the current healthy endpoints for each service, updated on change events from the registry.
Without caching, every microservice call requires a registry lookup — at high RPS this creates massive registry load.
Stale cache entries may route to dead endpoints. Combine with health check retries and short TTLs (10-30s).
Consul uses local agent caches on every host. Kubernetes uses kube-proxy's iptables rules as a cache layer.
A cached registry lookup takes <0.1ms vs 2-5ms for a live query. 50x latency reduction at scale.
At peak, run multiple service registry nodes behind a load balancer for high-availability service discovery.
A load balancer distributes service discovery requests across multiple registry nodes running in consensus.
The service registry is critical infrastructure. At peak, it receives registration and health check traffic from every service instance.
Registry nodes must agree on service state (Raft consensus). This limits write throughput — optimize read path with caching.
Consul runs as a cluster (3 or 5 servers for quorum). etcd underpins Kubernetes service discovery the same way.
A 3-node Consul cluster handles 100K+ reads/second with consistent reads via the cache layer.
§2Step 3 — Deep Dive
A Service Registry is a database of service instances: their names, network addresses, health status, and metadata. Services register on startup and query it to discover peers.
| Model | Who resolves? | Load balancing | Client complexity | Best for | Cost | Ops burden |
|---|---|---|---|---|---|---|
| Client-side (Consul + Ribbon) | Client reads registry | Client-controlled | High (needs SDK) | Microservices with smart clients ✓ | Medium | Medium |
| Server-side (Nginx/AWS ALB) | Load balancer | Centralized | None | Simple services, legacy clients | Medium | Medium |
| DNS-based (Route53) | DNS resolver | Round-robin via DNS | None | Cross-cloud, language-agnostic | Medium | Low |
| Service mesh (Envoy/Istio) | Sidecar proxy | Per-request, L7 | None (sidecar) | mTLS, observability, traffic shaping | High | High |
| K8s kube-proxy | iptables/IPVS | Round-robin | None | Kubernetes-native workloads | Medium | Medium |
Service discovery models — client-side for flexibility, server-side for simplicity.
import consul
import socket
c = consul.Consul(host='consul-1', port=8500)
def register_service(service_name: str, port: int):
service_id = f"{service_name}-{socket.gethostname()}"
c.agent.service.register(
name=service_name,
service_id=service_id,
address=socket.gethostbyname(socket.gethostname()),
port=port,
check=consul.Check.http(
url=f"http://localhost:{port}/health",
interval="10s", # Consul polls every 10s
timeout="2s", # 2s timeout
deregister="30s" # auto-deregister after 30s failing
),
tags=["v2", "us-east-1"]
)
print(f"Registered {service_id} with Consul")
def discover_service(service_name: str) -> list[tuple[str, int]]:
# Returns only healthy instances — Consul filters for us
_, services = c.health.service(service_name, passing=True)
return [
(s['Service']['Address'], s['Service']['Port'])
for s in services
]
# On startup:
register_service('order-service', port=8080)
# On each outbound call:
instances = discover_service('payment-service')
# [('10.0.1.5', 8080), ('10.0.1.6', 8080)]| Component | Why Add It | Tradeoff |
|---|---|---|
| Service Registry | In a dynamic environment, services scale up and down constantly and their IPs change. | The service registry becomes a critical dependency — if it's unavailable, services can't discover each other. |
| Cache for Service Registry | Without caching, every microservice call requires a registry lookup — at high RPS this creates massive registry load. | Stale cache entries may route to dead endpoints. |
| Load Balancer | The service registry is critical infrastructure. | Registry nodes must agree on service state (Raft consensus). |
Design decision tradeoffs
The service registry crashes. svc-a and other services can no longer discover healthy instances of svc-b. How do you implement client-side caching of registry data, health checks, and circuit breakers to maintain availability?
A network partition isolates parts of the service registry cluster. Some services see the registry reporting svc-b as up while others see it as down. How do you handle split-brain and ensure consistent service discovery?
Under heavy load, registry lookups slow from 1ms to 500ms. Services that block on discovery add 500ms to every request. How do you implement asynchronous discovery, local DNS caching, and circuit breaking to prevent cascading timeouts?
§3Step 4 — Wrap Up
| Decision | Choice | Why |
|---|---|---|
| Service Registry | A Service Registry is a database of service instances: their names, network addresses, health status, and metadata. | In a dynamic environment, services scale up and down constantly and their IPs change. |
| Cache for Service Registry | A Redis cache stores the current healthy endpoints for each service, updated on change events from the registry. | Without caching, every microservice call requires a registry lookup — at high RPS this creates massive registry load. |
| Load Balancer | A load balancer distributes service discovery requests across multiple registry nodes running in consensus. | The service registry is critical infrastructure. |
Key design decisions