What is important about Service Discovery regarding "Service discovery replaces static IP configuration with dyna..."?

Service discovery replaces static IP configuration with dynamic, health-aware resolution. Without it, deploying new instances, scaling up/down, or recovering from failures requires manual address updates across all callers.

What is important about Service Discovery regarding "Client-side discovery (caller queries registry) gives the ap..."?

Client-side discovery (caller queries registry) gives the application full control over load balancing but requires a discovery library in every language. Server-side discovery (DNS/LB resolves) is simpler for callers but less flexible.

What is important about Service Discovery regarding "Health checking is mandatory. A registry without health chec..."?

Health checking is mandatory. A registry without health checks is worse than no registry -- it directs traffic to dead or unhealthy instances. Readiness probes (HTTP 200, TCP connect, command execution) are the standard mechanism.

What is important about Service Discovery regarding "DNS-based discovery is the simplest approach (SRV records, A..."?

DNS-based discovery is the simplest approach (SRV records, A records) but has TTL caching issues -- stale DNS caches direct traffic to deregistered instances. CoreDNS in Kubernetes solves this with short TTLs and cluster-local resolution.

What is important about Service Discovery regarding "Consul, Eureka, etcd, and ZooKeeper are dedicated service re..."?

Consul, Eureka, etcd, and ZooKeeper are dedicated service registries. Consul provides DNS and HTTP APIs, multi-datacenter support, and integrated health checking. Eureka (Netflix) is simpler but lacks multi-DC support.

What is important about Service Discovery regarding "Service meshes (Istio/Envoy, Linkerd, Cilium) embed service ..."?

Service meshes (Istio/Envoy, Linkerd, Cilium) embed service discovery into the network layer, making it transparent to application code. The sidecar proxy maintains an up-to-date endpoint list and handles all routing decisions.

Vetora

🔍Consensus & Coordination

Service Discovery

Service discovery is the mechanism by which services in a distributed system locate each other's network addresses. It replaces hardcoded IP addresses with dynamic, health-aware resolution -- essential in cloud-native environments where instances are ephemeral and addresses change frequently.

Overview

In a microservices architecture, a single user request may traverse 5-20 services. Each service runs multiple instances across different machines, containers, or serverless functions, with IP addresses that change on every deployment, scaling event, or failure recovery. Service discovery solves the fundamental question: when Service A needs to call Service B, how does A find B's current network address? Hardcoding addresses is impossible when instances are ephemeral; DNS with long TTLs is too slow to reflect real-time changes; and manually updating configuration files does not scale.

There are two primary patterns. Client-side discovery means the calling service queries a service registry (e.g., Consul, Eureka, etcd) to get a list of healthy instances, then selects one using a load-balancing strategy (round-robin, least connections, consistent hashing). The client library handles registration, health checking, and address caching. Server-side discovery means the calling service sends requests to a load balancer or DNS endpoint that resolves to healthy instances. The caller does not know individual instance addresses. Kubernetes Services (backed by kube-proxy or Cilium) are the most common example -- a service name resolves to a virtual IP that is load-balanced to healthy pods.

Health checking is integral to service discovery. A registry must distinguish between healthy instances (can serve traffic), unhealthy instances (running but failing health checks), and dead instances (no longer running). Consul uses a combination of agent-level health checks (HTTP, TCP, TTL, script) and gossip-based failure detection. Kubernetes uses readiness probes (HTTP, TCP, exec) to add/remove pod IPs from the Endpoints list. Without health checking, a service registry becomes a liability -- directing traffic to unhealthy instances causes cascading failures.

Modern service meshes (Istio, Linkerd, Cilium) push service discovery into the infrastructure layer. Each service instance gets a sidecar proxy (Envoy) or eBPF program that intercepts all network traffic. The proxy receives endpoint updates from a control plane (via xDS API) and handles routing, load balancing, retries, and circuit breaking transparently. The application code simply calls a hostname; the mesh handles everything else. This eliminates the need for client-side discovery libraries in each language and provides a uniform traffic management layer across all services.

Key Points

1Service discovery replaces static IP configuration with dynamic, health-aware resolution. Without it, deploying new instances, scaling up/down, or recovering from failures requires manual address updates across all callers.
2Client-side discovery (caller queries registry) gives the application full control over load balancing but requires a discovery library in every language. Server-side discovery (DNS/LB resolves) is simpler for callers but less flexible.
3Health checking is mandatory. A registry without health checks is worse than no registry -- it directs traffic to dead or unhealthy instances. Readiness probes (HTTP 200, TCP connect, command execution) are the standard mechanism.
4DNS-based discovery is the simplest approach (SRV records, A records) but has TTL caching issues -- stale DNS caches direct traffic to deregistered instances. CoreDNS in Kubernetes solves this with short TTLs and cluster-local resolution.
5Consul, Eureka, etcd, and ZooKeeper are dedicated service registries. Consul provides DNS and HTTP APIs, multi-datacenter support, and integrated health checking. Eureka (Netflix) is simpler but lacks multi-DC support.
6Service meshes (Istio/Envoy, Linkerd, Cilium) embed service discovery into the network layer, making it transparent to application code. The sidecar proxy maintains an up-to-date endpoint list and handles all routing decisions.

Simple Example

Finding the Payment Service

The order service needs to call the payment service. With service discovery, the order service calls 'payment-service:8080' (a logical name). In Kubernetes, CoreDNS resolves this to a ClusterIP (e.g., 10.96.0.15). kube-proxy routes traffic to one of the healthy payment pods (e.g., 10.244.1.5:8080, 10.244.2.8:8080). If a payment pod fails its readiness probe, Kubernetes removes its IP from the Endpoints list within seconds, and no traffic is routed to it. When a new pod starts and passes readiness, it is automatically added. The order service's code never changes.

Real-World Examples

Kubernetes (CoreDNS + kube-proxy)

Kubernetes provides built-in service discovery via CoreDNS and Service objects. Each Service gets a stable DNS name (e.g., payment-service.default.svc.cluster.local) that resolves to a ClusterIP. kube-proxy (or Cilium's eBPF) load-balances connections to healthy pod IPs listed in the Endpoints object. Pod readiness probes control inclusion in the Endpoints list. This server-side discovery model means application code only needs to know the service name, not individual pod addresses.

HashiCorp Consul

Consul provides a full-featured service discovery platform with DNS and HTTP APIs, health checking (HTTP, TCP, TTL, gRPC, script), and multi-datacenter federation. Services register via a local Consul agent, which reports health status via gossip. Consul's prepared queries enable geographic failover: a query for 'payment-service' first tries the local DC, then fails over to the nearest healthy DC. Consul Connect adds mTLS and authorization policies, evolving the registry into a service mesh.

Netflix Eureka

Eureka is Netflix's client-side service discovery system. Each service instance registers with a Eureka server and sends heartbeats every 30 seconds. Client services fetch the registry (typically every 30 seconds) and use Ribbon (client-side load balancer) to select an instance. Eureka is AP by design -- during a network partition, it enters 'self-preservation mode' and stops evicting instances, preferring to serve stale data over removing healthy instances. This aligns with Netflix's availability-first philosophy.

Trade-Offs

Aspect	Description
Client-Side vs Server-Side Discovery	Client-side discovery (caller picks instance) gives applications full control over load-balancing strategy but requires a client library in every language and framework. Server-side discovery (DNS/LB picks instance) is language-agnostic and simpler for callers but offers less flexibility and adds a network hop through the load balancer.
Consistency vs Availability of Registry	A CP registry (etcd, ZooKeeper) guarantees accurate instance lists but becomes unavailable during partitions. An AP registry (Eureka) stays available but may serve stale data -- directing traffic to dead instances or missing newly registered ones. Most service discovery systems choose AP because stale data is preferable to no data when a partition is brief.
Push vs Pull Updates	Push-based updates (watches in etcd, blocking queries in Consul) notify clients immediately when endpoints change. Pull-based updates (periodic polling in Eureka) are simpler but introduce a staleness window equal to the polling interval. Push reduces stale-routing duration but requires persistent connections and more server resources.
DNS-Based vs Registry-Based	DNS is universal -- every language and framework supports it. But DNS caching (TTL) causes stale resolution, DNS does not support metadata or health status, and DNS round-robin cannot do least-connections or consistent hashing. Registry-based approaches offer richer features but require a dedicated client library or sidecar proxy.

Case Study

Netflix's Service Discovery at Scale with Eureka

Scenario

Netflix operates thousands of microservices across multiple AWS regions, with instances constantly launching, terminating, and failing. During peak streaming hours, the system serves 250+ million subscribers, requiring sub-second service discovery to route requests to healthy instances. Any delay in detecting a failed instance causes error spikes visible to users as playback failures.

Solution

Netflix built Eureka, an AP service registry designed for their AWS environment. Each service instance registers with the local Eureka server and sends heartbeats every 30 seconds. Client services (using the Eureka client library + Ribbon load balancer) fetch the full registry every 30 seconds and cache it locally. Eureka servers replicate registrations peer-to-peer across regions. When a service fails to heartbeat for 90 seconds, it is evicted -- unless Eureka enters 'self-preservation mode' (triggered when >15% of instances miss heartbeats simultaneously, indicating a network issue rather than mass failure).

Outcome

Eureka has served Netflix's discovery needs for over a decade, handling millions of instance registrations and billions of discovery queries. The AP design means that during AWS regional failures, Eureka continues serving cached data rather than going offline. Self-preservation mode prevented mass instance evictions during network partitions that would have caused cascading failures. Eureka's success validated the AP approach for service discovery and influenced the design of Consul and Kubernetes service abstractions.

Common Mistakes

⚠Relying on DNS caching for service discovery in dynamic environments. DNS TTL caching at the OS, JVM, or library level can hold stale entries for minutes after an instance is deregistered. Use a registry with push-based updates or a service mesh for real-time endpoint changes.
⚠Not implementing health checks. Registering instances without checking their health means the registry directs traffic to instances that are running but not ready to serve (e.g., still loading data, experiencing dependency failures). Always use readiness probes or equivalent health checks.
⚠Using service discovery without a client-side cache or fallback. If the registry is temporarily unreachable, clients that query it synchronously on every request will fail. Cache the latest known endpoints and use them as a fallback during registry outages.
⚠Ignoring cross-datacenter discovery. Many teams set up service discovery within a single data center but forget that failover to another DC also requires discovering services there. Consul's multi-DC federation and Kubernetes multi-cluster service mesh address this.

Related Concepts

Gossip Protocol Leader Election DNS CAP Theorem Circuit Breaker

See Service Discovery in action

Explore system design templates that use service discovery and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate service registration and health-check discovery

Metrics to watch

discovery_latency_msstale_endpoint_pcthealth_check_interval_msrouting_accuracy_pct

Run Simulation

Test Your Understanding

1What is the primary difference between client-side and server-side service discovery?

2Why did Netflix design Eureka as an AP system rather than a CP system?

Deeper Reading