1What is the primary difference between client-side and server-side service discovery?
Service discovery is the mechanism by which services in a distributed system locate each other's network addresses. It replaces hardcoded IP addresses with dynamic, health-aware resolution -- essential in cloud-native environments where instances are ephemeral and addresses change frequently.
In a microservices architecture, a single user request may traverse 5-20 services. Each service runs multiple instances across different machines, containers, or serverless functions, with IP addresses that change on every deployment, scaling event, or failure recovery. Service discovery solves the fundamental question: when Service A needs to call Service B, how does A find B's current network address? Hardcoding addresses is impossible when instances are ephemeral; DNS with long TTLs is too slow to reflect real-time changes; and manually updating configuration files does not scale.
There are two primary patterns. Client-side discovery means the calling service queries a service registry (e.g., Consul, Eureka, etcd) to get a list of healthy instances, then selects one using a load-balancing strategy (round-robin, least connections, consistent hashing). The client library handles registration, health checking, and address caching. Server-side discovery means the calling service sends requests to a load balancer or DNS endpoint that resolves to healthy instances. The caller does not know individual instance addresses. Kubernetes Services (backed by kube-proxy or Cilium) are the most common example -- a service name resolves to a virtual IP that is load-balanced to healthy pods.
Health checking is integral to service discovery. A registry must distinguish between healthy instances (can serve traffic), unhealthy instances (running but failing health checks), and dead instances (no longer running). Consul uses a combination of agent-level health checks (HTTP, TCP, TTL, script) and gossip-based failure detection. Kubernetes uses readiness probes (HTTP, TCP, exec) to add/remove pod IPs from the Endpoints list. Without health checking, a service registry becomes a liability -- directing traffic to unhealthy instances causes cascading failures.
Modern service meshes (Istio, Linkerd, Cilium) push service discovery into the infrastructure layer. Each service instance gets a sidecar proxy (Envoy) or eBPF program that intercepts all network traffic. The proxy receives endpoint updates from a control plane (via xDS API) and handles routing, load balancing, retries, and circuit breaking transparently. The application code simply calls a hostname; the mesh handles everything else. This eliminates the need for client-side discovery libraries in each language and provides a uniform traffic management layer across all services.
Finding the Payment Service
The order service needs to call the payment service. With service discovery, the order service calls 'payment-service:8080' (a logical name). In Kubernetes, CoreDNS resolves this to a ClusterIP (e.g., 10.96.0.15). kube-proxy routes traffic to one of the healthy payment pods (e.g., 10.244.1.5:8080, 10.244.2.8:8080). If a payment pod fails its readiness probe, Kubernetes removes its IP from the Endpoints list within seconds, and no traffic is routed to it. When a new pod starts and passes readiness, it is automatically added. The order service's code never changes.
Kubernetes (CoreDNS + kube-proxy)
Kubernetes provides built-in service discovery via CoreDNS and Service objects. Each Service gets a stable DNS name (e.g., payment-service.default.svc.cluster.local) that resolves to a ClusterIP. kube-proxy (or Cilium's eBPF) load-balances connections to healthy pod IPs listed in the Endpoints object. Pod readiness probes control inclusion in the Endpoints list. This server-side discovery model means application code only needs to know the service name, not individual pod addresses.
HashiCorp Consul
Consul provides a full-featured service discovery platform with DNS and HTTP APIs, health checking (HTTP, TCP, TTL, gRPC, script), and multi-datacenter federation. Services register via a local Consul agent, which reports health status via gossip. Consul's prepared queries enable geographic failover: a query for 'payment-service' first tries the local DC, then fails over to the nearest healthy DC. Consul Connect adds mTLS and authorization policies, evolving the registry into a service mesh.
Netflix Eureka
Eureka is Netflix's client-side service discovery system. Each service instance registers with a Eureka server and sends heartbeats every 30 seconds. Client services fetch the registry (typically every 30 seconds) and use Ribbon (client-side load balancer) to select an instance. Eureka is AP by design -- during a network partition, it enters 'self-preservation mode' and stops evicting instances, preferring to serve stale data over removing healthy instances. This aligns with Netflix's availability-first philosophy.
| Aspect | Description |
|---|---|
| Client-Side vs Server-Side Discovery | Client-side discovery (caller picks instance) gives applications full control over load-balancing strategy but requires a client library in every language and framework. Server-side discovery (DNS/LB picks instance) is language-agnostic and simpler for callers but offers less flexibility and adds a network hop through the load balancer. |
| Consistency vs Availability of Registry | A CP registry (etcd, ZooKeeper) guarantees accurate instance lists but becomes unavailable during partitions. An AP registry (Eureka) stays available but may serve stale data -- directing traffic to dead instances or missing newly registered ones. Most service discovery systems choose AP because stale data is preferable to no data when a partition is brief. |
| Push vs Pull Updates | Push-based updates (watches in etcd, blocking queries in Consul) notify clients immediately when endpoints change. Pull-based updates (periodic polling in Eureka) are simpler but introduce a staleness window equal to the polling interval. Push reduces stale-routing duration but requires persistent connections and more server resources. |
| DNS-Based vs Registry-Based | DNS is universal -- every language and framework supports it. But DNS caching (TTL) causes stale resolution, DNS does not support metadata or health status, and DNS round-robin cannot do least-connections or consistent hashing. Registry-based approaches offer richer features but require a dedicated client library or sidecar proxy. |
Netflix's Service Discovery at Scale with Eureka
Scenario
Netflix operates thousands of microservices across multiple AWS regions, with instances constantly launching, terminating, and failing. During peak streaming hours, the system serves 250+ million subscribers, requiring sub-second service discovery to route requests to healthy instances. Any delay in detecting a failed instance causes error spikes visible to users as playback failures.
Solution
Netflix built Eureka, an AP service registry designed for their AWS environment. Each service instance registers with the local Eureka server and sends heartbeats every 30 seconds. Client services (using the Eureka client library + Ribbon load balancer) fetch the full registry every 30 seconds and cache it locally. Eureka servers replicate registrations peer-to-peer across regions. When a service fails to heartbeat for 90 seconds, it is evicted -- unless Eureka enters 'self-preservation mode' (triggered when >15% of instances miss heartbeats simultaneously, indicating a network issue rather than mass failure).
Outcome
Eureka has served Netflix's discovery needs for over a decade, handling millions of instance registrations and billions of discovery queries. The AP design means that during AWS regional failures, Eureka continues serving cached data rather than going offline. Self-preservation mode prevented mass instance evictions during network partitions that would have caused cascading failures. Eureka's success validated the AP approach for service discovery and influenced the design of Consul and Kubernetes service abstractions.
See Service Discovery in action
Explore system design templates that use service discovery and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary difference between client-side and server-side service discovery?
2Why did Netflix design Eureka as an AP system rather than a CP system?