Vetora logo
⚖️Networking & Protocols

Load Balancing Algorithms

Load balancing algorithms determine how traffic is distributed across backend servers. From simple round-robin to sophisticated power-of-two-choices and EWMA, the algorithm choice directly impacts latency, throughput, and fault tolerance.

Overview

Load balancing algorithms sit at the heart of every scalable system, determining how incoming requests are distributed across a pool of backend servers. The choice of algorithm directly impacts tail latency, throughput, and fault tolerance. A poor algorithm can create hot spots (overloading some servers while others are idle), amplify failures (sending traffic to unhealthy servers), or cause unnecessary cache misses (routing the same user to different servers, fragmenting cached state). Understanding the trade-offs between algorithms is essential for designing systems that perform well under both normal load and failure conditions.

The simplest algorithm is round-robin: requests are distributed to servers in sequential order (server 1, server 2, server 3, server 1, ...). Round-robin is easy to implement and provides perfectly even distribution when all servers are identical and all requests take the same time. In practice, neither assumption holds: servers may have different CPU/memory capacities, and request processing times vary dramatically (a cache hit takes 1ms, a cache miss with database query takes 100ms). Weighted round-robin addresses capacity differences by assigning each server a weight proportional to its capacity (a 16-core server gets weight 4, a 4-core server gets weight 1), but it still does not account for varying request complexity or transient server slowness.

Least connections is one of the most effective algorithms for handling heterogeneous request processing times. It routes each new request to the server with the fewest currently active (in-flight) connections. A server that is processing a slow request accumulates connections, so the algorithm naturally routes away from it. This makes least connections self-correcting: slow servers receive less traffic, fast servers receive more. IP hash produces sticky sessions by hashing the client IP address to a specific server, ensuring the same client always reaches the same server. This is useful for preserving session state and cache locality but can cause uneven distribution if some client IPs generate disproportionate traffic. Consistent hashing extends this idea with minimal disruption: when a server is added or removed, only a fraction of requests are redistributed (approximately 1/N for N servers), making it ideal for caching layers where redistributing requests means cache misses.

Modern algorithms like power-of-two-choices (P2C) and EWMA represent the state of the art. P2C randomly selects two servers from the pool and routes the request to the one with fewer active connections. Despite its simplicity, P2C achieves exponentially better load distribution than random selection -- the maximum load on any server is O(log log N) instead of O(log N / log log N). P2C is used by HAProxy, Envoy, and many internal load balancers at Google. EWMA (Exponentially Weighted Moving Average) tracks each server's recent response latency using a weighted average that emphasizes recent measurements. Requests are routed to the server with the lowest EWMA latency, automatically avoiding servers experiencing transient slowness (GC pauses, disk I/O spikes). The combination of P2C with EWMA -- picking two random servers and choosing the one with lower EWMA latency -- is particularly effective and is the default algorithm in Linkerd.

Key Points
  • 1Round-robin distributes requests evenly by cycling through servers sequentially. It works well when servers are homogeneous and request processing times are uniform, but it ignores actual server load and capacity differences.
  • 2Least connections routes to the server with the fewest active connections, naturally adapting to servers with varying processing speeds. A slow server accumulates connections and receives fewer new requests, making the algorithm self-correcting.
  • 3Consistent hashing maps both servers and requests to a hash ring, minimizing redistribution when servers are added or removed (only ~1/N of requests remap). This is critical for caching layers where remapping means cache misses.
  • 4Power-of-two-choices (P2C) picks two random servers and routes to the less loaded one. This achieves exponentially better load distribution than pure random (O(log log N) max load) while requiring only O(1) state -- no global connection counts needed.
  • 5EWMA tracks each server's recent response latency with exponential decay, automatically routing away from servers experiencing transient performance degradation. Combining P2C with EWMA (Linkerd's default) is the current state of the art.
  • 6Health checks are essential regardless of algorithm. Active health checks probe servers periodically (HTTP GET /health). Passive health checks track request failures and remove servers that exceed an error threshold. Both are needed for production reliability.
Simple Example

The Grocery Store Checkout Analogy

Imagine arriving at a grocery store with 5 checkout lanes. Round-robin is like a greeter directing each new customer to the next lane in sequence, regardless of how long each lane's line is. Least connections is like choosing the lane with the fewest people currently in line. IP hash is like always going to 'your' lane based on your loyalty card number -- good for building a relationship with your cashier but bad if your lane is slow. Power-of-two-choices is like looking at two random lanes and picking the shorter one -- surprisingly effective because you almost never end up in the longest line. EWMA is like tracking how fast each cashier has been scanning items in the last few minutes and choosing the fastest one.

Real-World Examples

Nginx

Nginx supports multiple load balancing algorithms: round-robin (default), least_conn (least connections), ip_hash (sticky sessions by client IP), and hash (consistent hashing on arbitrary keys). The 'least_conn' directive is recommended for applications with varying request durations (e.g., long-polling, WebSocket mixed with short HTTP requests). Nginx also supports server weights and slow-start (gradually increasing traffic to a newly added server).

HAProxy

HAProxy is a high-performance load balancer supporting sophisticated algorithms including roundrobin, leastconn, source (IP hash), and 'random(2)' which implements power-of-two-choices. HAProxy's health checking is particularly advanced: it supports HTTP health checks with status code matching, TCP checks, custom scripts, and 'observed' passive health tracking that removes servers based on real traffic error rates. HAProxy can handle millions of concurrent connections with sub-millisecond decision latency.

AWS ALB

AWS Application Load Balancer uses round-robin by default but supports Least Outstanding Requests (LOR) routing, which routes to the target with the fewest in-flight requests (similar to least connections but at the request level rather than TCP connection level). ALB also implements slow start mode, gradually ramping traffic to new targets over a configurable period (30-900 seconds) to allow warm-up of caches and JIT compilation before receiving full load.

Trade-Offs
AspectDescription
Simplicity vs AdaptabilityRound-robin is trivial to implement with zero per-request computation but does not adapt to varying server load or request complexity. Least connections adapts automatically but requires tracking active connection counts for each server. P2C provides near-optimal adaptation with minimal state, striking the best balance for most use cases.
Session Affinity vs Load DistributionIP hash and consistent hashing provide session affinity (the same client hits the same server), enabling server-side caching and session state. However, affinity can cause uneven distribution if some clients generate disproportionate traffic. The trade-off is between cache/session locality and even load distribution.
Global State vs Local DecisionsLeast connections requires global state: the load balancer must know the active connection count for every server. In a distributed load balancing architecture (multiple LB instances), keeping this state synchronized adds complexity. P2C and random algorithms require only local state, making them more suitable for distributed LB architectures.
Active vs Passive Health ChecksActive health checks (periodic probes) detect failures proactively but add load to backend servers and may not reflect real request processing health. Passive checks (tracking actual request failures) detect failures based on real traffic but react more slowly -- several requests must fail before the server is removed. Production systems use both for comprehensive health monitoring.
Case Study

Envoy's Load Balancing at Lyft -- P2C with Zone-Aware Routing

Scenario

Lyft's microservice architecture had hundreds of services communicating through Envoy sidecar proxies. Simple round-robin caused problems: services in different availability zones had different latencies (cross-AZ traffic added 1-2ms), and some service instances were slower due to noisy neighbors, GC pauses, or cold JIT caches. Round-robin routed to these slow instances at the same rate as healthy ones, degrading tail latency for the entire service mesh.

Solution

Envoy implemented power-of-two-choices (P2C) with zone-aware routing and outlier detection. P2C picks two random backends and routes to the one with fewer active requests, naturally avoiding slow instances. Zone-aware routing preferentially selects backends in the same availability zone, reducing cross-AZ latency. Outlier detection (passive health checking) tracks error rates and latencies per backend, ejecting backends that exceed thresholds (e.g., >5% error rate or >3x median latency). Ejected backends are gradually reintroduced with a slow-start period.

Outcome

P2C reduced p99 latency by 25-40% compared to round-robin because slow instances received proportionally fewer requests. Zone-aware routing reduced cross-AZ traffic by 60%, saving latency and AWS data transfer costs. Outlier detection automatically removed problematic instances within seconds, without requiring manual intervention. The combination of P2C, zone awareness, and outlier detection became Envoy's recommended default and was adopted across the service mesh.

Common Mistakes
  • Using round-robin for services with highly variable request processing times. If some requests take 1ms and others take 500ms, round-robin sends the same number of requests to a server processing a slow request, causing connection queue buildup. Least connections or P2C adapts automatically.
  • Not implementing slow start for newly added servers. A new server with cold caches and un-JITted code processes requests slower than warmed-up servers. Routing full traffic to it immediately causes latency spikes. Gradually ramp traffic over 30-120 seconds.
  • Relying solely on active health checks with infrequent intervals (e.g., every 30 seconds). A server can become unhealthy between checks, and all requests during that interval fail. Combine active checks with passive failure tracking for faster detection.
  • Using IP hash for load balancing behind a NAT or proxy. If many clients share the same source IP (corporate NAT, VPN exit node), all traffic goes to a single backend, creating a hot spot. Use cookie-based affinity or a more granular hash key instead.
Related Concepts

See Load Balancing Algorithms in action

Explore system design templates that use load balancing algorithms and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Compare round-robin vs least-connections under bursty traffic

Metrics to watch
p99_latency_msserver_utilization_pctthroughput_rpserror_rate_pct
Run Simulation
Test Your Understanding

1Why does the least-connections algorithm perform better than round-robin for services with variable request durations?

2What makes power-of-two-choices (P2C) effective despite its simplicity?

3Why is consistent hashing preferred for load balancing a caching layer?

Deeper Reading