Vetora logo
🔍Cloud-Native

Service Discovery

Service discovery enables services in a distributed system to find and communicate with each other without hardcoded addresses. As containers and pods are created and destroyed dynamically, a discovery mechanism maintains an up-to-date registry of available service instances and their locations.

Overview

In traditional monolithic architectures, service locations are static: the database is at 10.0.1.5:5432, the cache is at 10.0.1.6:6379. These addresses are hardcoded in configuration files and rarely change. In cloud-native architectures with containers and auto-scaling, service instances are ephemeral -- they are created, destroyed, and rescheduled across hosts continuously. A service that had 3 instances at 10.0.1.{5,6,7} five minutes ago might now have 5 instances at completely different IPs. Service discovery is the mechanism that maintains an accurate, real-time map of service names to instance locations.

There are two fundamental patterns: client-side discovery and server-side discovery. In client-side discovery, each service queries a service registry (Consul, etcd, ZooKeeper) to get the list of available instances for a target service, then selects one using a client-side load-balancing algorithm (round-robin, least-connections, consistent hashing). Netflix Eureka and Ribbon popularized this pattern. The advantage is full control over load balancing; the disadvantage is coupling every service to the registry client library.

In server-side discovery, the calling service sends requests to a stable endpoint (a load balancer, reverse proxy, or Kubernetes Service), which routes to healthy backend instances. The caller does not need to know about the registry or load-balancing logic. Kubernetes Services (ClusterIP backed by kube-proxy iptables/IPVS rules), AWS ALB target groups, and Envoy-based service meshes (Istio, Linkerd) implement server-side discovery. This is simpler for service developers but adds a network hop and centralizes the routing logic.

DNS-based discovery is a universal approach that works with both patterns. Kubernetes provides automatic DNS entries (my-service.my-namespace.svc.cluster.local) that resolve to the Service's ClusterIP. Consul provides DNS interfaces where my-service.service.consul resolves to healthy instance IPs. The challenge with DNS is TTL caching: clients, OS resolvers, and intermediary caches may serve stale records for seconds to minutes after an instance dies, causing connection failures. Low TTLs (5-10 seconds) mitigate this but increase DNS query load.

Key Points
  • 1Client-side discovery: service queries a registry (Consul, Eureka, etcd) for instance addresses and load-balances locally. Gives full control over routing (weighted, canary, session-affinity) but couples services to the registry SDK.
  • 2Server-side discovery: service calls a stable virtual IP or DNS name (Kubernetes Service, ALB), and a proxy/load-balancer routes to healthy instances. Simpler for services but adds a hop and centralizes routing.
  • 3Health checking is essential: the registry must continuously verify that registered instances are alive (TCP, HTTP, gRPC health checks). Stale entries route traffic to dead instances, causing errors and latency spikes.
  • 4DNS-based discovery is universally compatible but suffers from TTL caching: after an instance dies, cached DNS records may route traffic to the dead instance for seconds to minutes. DNS is best used alongside active health checking.
  • 5Kubernetes CoreDNS provides built-in service discovery: every Service gets a DNS entry. Headless Services (clusterIP: None) return individual pod IPs, enabling client-side load balancing for stateful workloads.
  • 6Service mesh (Istio, Linkerd) combines service discovery with traffic management, retries, circuit breaking, mTLS, and observability in a sidecar proxy (Envoy), making the application code entirely unaware of discovery mechanics.
Simple Example

Phone Directory Analogy

Service discovery works like a company phone directory. In the old world (monoliths), everyone had a fixed desk phone with a number that never changed -- you memorized it or wrote it on a sticky note (hardcoded config). In the modern world (microservices), employees are mobile -- they work from different desks, offices, or remotely every day. A phone directory service (the registry) tracks everyone's current number and location. Client-side discovery is like looking up a colleague in the directory app and calling them directly. Server-side discovery is like calling the company switchboard (load balancer), which connects you to the right person. Both need the directory to be accurate -- a stale entry means your call goes to an empty desk (dead instance).

Real-World Examples

Netflix

Netflix built Eureka, an open-source service discovery system, for their 1,000+ microservices on AWS. Each service registers with Eureka on startup and sends heartbeats every 30 seconds. Clients use the Ribbon library for client-side load balancing with Eureka as the registry. Eureka's AP design (availability over consistency) ensures services can still discover each other during network partitions, at the cost of potentially stale registrations.

HashiCorp / Consul Users

HashiCorp Consul is used by organizations like Stripe, Criteo, and Ticketmaster for multi-datacenter service discovery. Consul combines a service registry with health checking (TCP, HTTP, gRPC, script-based), DNS and HTTP discovery interfaces, and a key-value store for configuration. Its gossip-based protocol (Serf) detects node failures in seconds, and its Raft-based consensus ensures consistent registry state across datacenters.

Uber

Uber built Hyperbahn, a custom service discovery and routing layer using TChannel (their RPC protocol). Each service registers with Hyperbahn on startup, and all inter-service calls route through the Hyperbahn mesh, which handles discovery, load balancing, and circuit breaking. This server-side discovery model means individual services have zero knowledge of the service topology, simplifying service development at the cost of additional network hops.

Trade-Offs
AspectDescription
Client-Side vs. Server-Side DiscoveryClient-side discovery (Eureka, Consul SDK) gives services full control over load balancing, routing, and failover, but requires every service to include the discovery client library, creating language-specific SDKs and coupling. Server-side discovery (K8s Services, ALB, Envoy) is language-agnostic and simpler for services but adds a network hop, centralizes routing logic, and makes custom load-balancing harder.
Consistency vs. Availability in RegistryA CP registry (ZooKeeper, etcd) provides consistent views but may become unavailable during partitions -- new services cannot register. An AP registry (Eureka) stays available during partitions but may serve stale entries (a deregistered service is still returned). Most service discovery systems favor AP: serving slightly stale data is better than returning no data at all.
DNS vs. API-Based DiscoveryDNS is universal (every language and framework supports it) but limited: no health metadata, TTL caching delays propagation, and round-robin is the only native load-balancing strategy. API-based discovery (Consul HTTP API, Eureka REST) provides rich metadata (health status, zone, version, weight) but requires a client library.
Push vs. Pull RegistrationSelf-registration (service registers itself on startup) is simple but means the service must know the registry address and handle re-registration on restart. Third-party registration (a separate registrar watches for new containers and registers them) decouples services from the registry but adds a component that must be monitored and maintained.
Case Study

Airbnb's Migration from DNS to Envoy-Based Service Discovery

Scenario

Airbnb initially used DNS-based service discovery with Route53 for their microservices on AWS. As the number of services grew to 1,000+, DNS TTL caching caused recurring issues: when a service instance was replaced (auto-scaling, deployment), clients held stale DNS records for up to 60 seconds, sending requests to terminated instances. This resulted in ~0.5% error rate during deployments and scaling events.

Solution

Airbnb migrated to an Envoy-based service mesh with a custom control plane. Each service runs an Envoy sidecar that receives real-time endpoint updates from the control plane via xDS protocol (push-based, not DNS). Health checking runs at the Envoy level: unhealthy instances are removed from the load-balancing pool within 5 seconds. The control plane integrates with their Kubernetes clusters, EC2 auto-scaling groups, and legacy services, providing a unified discovery layer.

Outcome

Deployment-related error rates dropped from ~0.5% to near zero. Endpoint propagation time decreased from 60 seconds (DNS TTL) to under 5 seconds (real-time xDS push). The Envoy sidecar also enabled traffic shifting for canary deployments (1% -> 5% -> 25% -> 100%), circuit breaking for failing dependencies, and automatic retries with exponential backoff -- capabilities that previously required application-level code in each service.

Common Mistakes
  • Relying solely on DNS without health checking. DNS records may point to instances that are running but not healthy (deadlocked, out of memory, failing health checks). Always pair DNS discovery with active health checking that removes unhealthy instances from DNS responses within seconds.
  • Setting DNS TTL too high. A 300-second TTL means clients may route traffic to a dead instance for 5 minutes after it terminates. Use TTLs of 5-15 seconds for service discovery DNS records, and implement client-side connection retry logic for the remaining propagation delay.
  • Not implementing graceful shutdown. When a service instance receives a termination signal, it should deregister from the service registry, stop accepting new connections, drain existing requests (with a timeout), and then exit. Without graceful shutdown, in-flight requests are dropped.
  • Hardcoding service addresses in configuration. Even in environments without dynamic scaling, hardcoded IPs create operational fragility. Use service discovery or at minimum DNS names so that infrastructure changes (IP reassignment, AZ migration) do not require application redeployment.
Related Concepts

See Service Discovery in action

Explore system design templates that use service discovery and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate service registration and health-check discovery

Metrics to watch
discovery_latency_msstale_endpoint_pcthealth_check_interval_msrouting_accuracy_pct
Run Simulation
Test Your Understanding

1What is the primary difference between client-side and server-side service discovery?

2Why is DNS-based service discovery problematic for rapidly changing environments?

Deeper Reading