1Why does the least-connections algorithm perform better than round-robin for services with variable request durations?
Load balancing algorithms determine how traffic is distributed across backend servers. From simple round-robin to sophisticated power-of-two-choices and EWMA, the algorithm choice directly impacts latency, throughput, and fault tolerance.
Load balancing algorithms sit at the heart of every scalable system, determining how incoming requests are distributed across a pool of backend servers. The choice of algorithm directly impacts tail latency, throughput, and fault tolerance. A poor algorithm can create hot spots (overloading some servers while others are idle), amplify failures (sending traffic to unhealthy servers), or cause unnecessary cache misses (routing the same user to different servers, fragmenting cached state). Understanding the trade-offs between algorithms is essential for designing systems that perform well under both normal load and failure conditions.
The simplest algorithm is round-robin: requests are distributed to servers in sequential order (server 1, server 2, server 3, server 1, ...). Round-robin is easy to implement and provides perfectly even distribution when all servers are identical and all requests take the same time. In practice, neither assumption holds: servers may have different CPU/memory capacities, and request processing times vary dramatically (a cache hit takes 1ms, a cache miss with database query takes 100ms). Weighted round-robin addresses capacity differences by assigning each server a weight proportional to its capacity (a 16-core server gets weight 4, a 4-core server gets weight 1), but it still does not account for varying request complexity or transient server slowness.
Least connections is one of the most effective algorithms for handling heterogeneous request processing times. It routes each new request to the server with the fewest currently active (in-flight) connections. A server that is processing a slow request accumulates connections, so the algorithm naturally routes away from it. This makes least connections self-correcting: slow servers receive less traffic, fast servers receive more. IP hash produces sticky sessions by hashing the client IP address to a specific server, ensuring the same client always reaches the same server. This is useful for preserving session state and cache locality but can cause uneven distribution if some client IPs generate disproportionate traffic. Consistent hashing extends this idea with minimal disruption: when a server is added or removed, only a fraction of requests are redistributed (approximately 1/N for N servers), making it ideal for caching layers where redistributing requests means cache misses.
Modern algorithms like power-of-two-choices (P2C) and EWMA represent the state of the art. P2C randomly selects two servers from the pool and routes the request to the one with fewer active connections. Despite its simplicity, P2C achieves exponentially better load distribution than random selection -- the maximum load on any server is O(log log N) instead of O(log N / log log N). P2C is used by HAProxy, Envoy, and many internal load balancers at Google. EWMA (Exponentially Weighted Moving Average) tracks each server's recent response latency using a weighted average that emphasizes recent measurements. Requests are routed to the server with the lowest EWMA latency, automatically avoiding servers experiencing transient slowness (GC pauses, disk I/O spikes). The combination of P2C with EWMA -- picking two random servers and choosing the one with lower EWMA latency -- is particularly effective and is the default algorithm in Linkerd.
The Grocery Store Checkout Analogy
Imagine arriving at a grocery store with 5 checkout lanes. Round-robin is like a greeter directing each new customer to the next lane in sequence, regardless of how long each lane's line is. Least connections is like choosing the lane with the fewest people currently in line. IP hash is like always going to 'your' lane based on your loyalty card number -- good for building a relationship with your cashier but bad if your lane is slow. Power-of-two-choices is like looking at two random lanes and picking the shorter one -- surprisingly effective because you almost never end up in the longest line. EWMA is like tracking how fast each cashier has been scanning items in the last few minutes and choosing the fastest one.
Nginx
Nginx supports multiple load balancing algorithms: round-robin (default), least_conn (least connections), ip_hash (sticky sessions by client IP), and hash (consistent hashing on arbitrary keys). The 'least_conn' directive is recommended for applications with varying request durations (e.g., long-polling, WebSocket mixed with short HTTP requests). Nginx also supports server weights and slow-start (gradually increasing traffic to a newly added server).
HAProxy
HAProxy is a high-performance load balancer supporting sophisticated algorithms including roundrobin, leastconn, source (IP hash), and 'random(2)' which implements power-of-two-choices. HAProxy's health checking is particularly advanced: it supports HTTP health checks with status code matching, TCP checks, custom scripts, and 'observed' passive health tracking that removes servers based on real traffic error rates. HAProxy can handle millions of concurrent connections with sub-millisecond decision latency.
AWS ALB
AWS Application Load Balancer uses round-robin by default but supports Least Outstanding Requests (LOR) routing, which routes to the target with the fewest in-flight requests (similar to least connections but at the request level rather than TCP connection level). ALB also implements slow start mode, gradually ramping traffic to new targets over a configurable period (30-900 seconds) to allow warm-up of caches and JIT compilation before receiving full load.
| Aspect | Description |
|---|---|
| Simplicity vs Adaptability | Round-robin is trivial to implement with zero per-request computation but does not adapt to varying server load or request complexity. Least connections adapts automatically but requires tracking active connection counts for each server. P2C provides near-optimal adaptation with minimal state, striking the best balance for most use cases. |
| Session Affinity vs Load Distribution | IP hash and consistent hashing provide session affinity (the same client hits the same server), enabling server-side caching and session state. However, affinity can cause uneven distribution if some clients generate disproportionate traffic. The trade-off is between cache/session locality and even load distribution. |
| Global State vs Local Decisions | Least connections requires global state: the load balancer must know the active connection count for every server. In a distributed load balancing architecture (multiple LB instances), keeping this state synchronized adds complexity. P2C and random algorithms require only local state, making them more suitable for distributed LB architectures. |
| Active vs Passive Health Checks | Active health checks (periodic probes) detect failures proactively but add load to backend servers and may not reflect real request processing health. Passive checks (tracking actual request failures) detect failures based on real traffic but react more slowly -- several requests must fail before the server is removed. Production systems use both for comprehensive health monitoring. |
Envoy's Load Balancing at Lyft -- P2C with Zone-Aware Routing
Scenario
Lyft's microservice architecture had hundreds of services communicating through Envoy sidecar proxies. Simple round-robin caused problems: services in different availability zones had different latencies (cross-AZ traffic added 1-2ms), and some service instances were slower due to noisy neighbors, GC pauses, or cold JIT caches. Round-robin routed to these slow instances at the same rate as healthy ones, degrading tail latency for the entire service mesh.
Solution
Envoy implemented power-of-two-choices (P2C) with zone-aware routing and outlier detection. P2C picks two random backends and routes to the one with fewer active requests, naturally avoiding slow instances. Zone-aware routing preferentially selects backends in the same availability zone, reducing cross-AZ latency. Outlier detection (passive health checking) tracks error rates and latencies per backend, ejecting backends that exceed thresholds (e.g., >5% error rate or >3x median latency). Ejected backends are gradually reintroduced with a slow-start period.
Outcome
P2C reduced p99 latency by 25-40% compared to round-robin because slow instances received proportionally fewer requests. Zone-aware routing reduced cross-AZ traffic by 60%, saving latency and AWS data transfer costs. Outlier detection automatically removed problematic instances within seconds, without requiring manual intervention. The combination of P2C, zone awareness, and outlier detection became Envoy's recommended default and was adopted across the service mesh.
See Load Balancing Algorithms in action
Explore system design templates that use load balancing algorithms and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why does the least-connections algorithm perform better than round-robin for services with variable request durations?
2What makes power-of-two-choices (P2C) effective despite its simplicity?
3Why is consistent hashing preferred for load balancing a caching layer?