Vetora logo
👑Consensus & Coordination

Leader Election

Leader election is the process of designating a single node in a distributed system as the coordinator for a specific task. It is fundamental to many distributed algorithms -- from database replication to job scheduling -- and can be implemented via consensus protocols, lease-based mechanisms, or distributed lock services.

Overview

Leader election is one of the most common coordination tasks in distributed systems. A leader (also called master, primary, or coordinator) is a single node designated to perform a specific role -- accepting writes in a replicated database, scheduling tasks in a job system, or owning a partition in a stream processor. The leader pattern simplifies distributed protocols by converting a distributed consensus problem into a centralized decision-making one, at the cost of a single point of failure that must be quickly replaced.

There are several approaches to leader election. Consensus-based election uses Raft or Paxos, where the election mechanism is built into the protocol -- Raft's term-based voting and Paxos's ballot numbers both produce exactly one leader per round. External coordination services like ZooKeeper (ephemeral sequential znodes), etcd (lease-based), and Consul (session-based) provide leader election as a higher-level primitive: clients create ephemeral nodes or leases, and the client with the lowest sequence number or active lease is the leader. Simpler algorithms like the bully algorithm work in synchronous networks where timeout bounds are known.

The critical safety requirement is avoiding split-brain -- a situation where two nodes simultaneously believe they are the leader. Split-brain can cause data corruption (two leaders accepting conflicting writes), duplicate processing (two leaders executing the same job), and resource contention. Split-brain typically occurs during network partitions: the old leader is alive but unreachable, and a new leader is elected on the other side of the partition. Preventing split-brain requires either consensus (majority agreement) or fencing (the new leader invalidates the old leader's access to shared resources).

Liveness is equally important: after a leader failure, a new leader must be elected quickly to minimize downtime. The detection-to-election time depends on the heartbeat interval, election timeout, and the number of election rounds. In Raft, a typical failover takes 1-5 seconds (one election timeout plus one round of voting). ZooKeeper-based election is faster if the next-in-line client is watching the leader's ephemeral znode -- it detects deletion immediately and acquires leadership within the session timeout. Balancing fast detection (short timeouts) with stability (avoiding false positives) is the key operational trade-off.

Key Points
  • 1Leader election reduces distributed problems to centralized ones. Instead of N nodes deciding what to do, one leader decides and the others follow. This simplifies write ordering, scheduling, and coordination at the cost of a single point of failure.
  • 2Split-brain is the primary safety hazard. Two nodes acting as leader simultaneously can corrupt data. Preventing split-brain requires majority-based consensus, fencing tokens, or lease-based expiration that is strictly shorter than the detection timeout.
  • 3Raft and Paxos provide built-in leader election via their consensus mechanisms. External systems (ZooKeeper, etcd, Consul) offer leader election as a service, abstracting the consensus protocol.
  • 4Leader failure detection depends on heartbeat intervals and timeouts. A 10-second election timeout means up to 10 seconds of unavailability after a leader crash. Shorter timeouts risk false positives during network hiccups.
  • 5Lease-based election ties leadership to a time-limited lease that must be renewed. If the leader fails to renew (crash or partition), the lease expires and another node can acquire it. This provides automatic failover without explicit 'step-down' messages.
  • 6The choice between embedded election (Raft in the application) and external election (ZooKeeper/etcd sidecar) depends on operational preferences. Embedded election reduces dependencies; external election provides a proven, tested implementation.
Simple Example

Three Database Replicas Electing a Primary

Replicas A, B, and C form a cluster. A is the current leader, accepting all writes. A's heartbeat fails (A crashes or is partitioned). B and C detect the missing heartbeat after 5 seconds. B starts an election, increments the term, and asks C for a vote. C votes for B (B's log is up-to-date). B wins with 2 votes (itself + C) out of 3 -- a majority. B becomes the new leader and begins accepting writes. If A recovers, it discovers B's higher term and steps down to follower. At no point do both A and B act as leader in the same term.

Real-World Examples

Apache Kafka (Partition Leaders)

Kafka uses leader election for every topic partition. One broker is elected as the partition leader, handling all reads and writes for that partition. Followers replicate the leader's log. If the leader fails, the Kafka controller (itself elected via ZooKeeper or KRaft) selects a new leader from the in-sync replicas (ISR). Kafka's KRaft mode (replacing ZooKeeper) uses Raft for both controller election and metadata management, providing faster failover and eliminating the ZooKeeper dependency.

Kubernetes (Controller Leader Election)

Kubernetes controllers (scheduler, controller-manager) use leader election to ensure exactly one active instance. Multiple replicas run for high availability, but only the leader actively reconciles state. Leader election uses etcd leases: the leader creates a Lease object and renews it periodically (default: every 10s with a 15s TTL). If the leader pod crashes, the lease expires and another replica acquires it. The election is visible via the 'leases' API resource.

Google Chubby

Chubby provides leader election as a primary use case. A group of processes compete to acquire a Chubby lock (backed by Paxos consensus). The winner holds the lock and acts as leader. Other processes watch the lock and attempt acquisition when it is released. Chubby's design guarantees that the lock is held by exactly one client at a time, and the sequencer mechanism (similar to fencing tokens) prevents stale leaders from acting on shared state.

Trade-Offs
AspectDescription
Failover Speed vs False PositivesShorter heartbeat intervals and election timeouts enable faster failover but increase the likelihood of unnecessary elections during transient network issues. An unnecessary election stalls writes for the election duration and may trigger cascading effects (e.g., client reconnections). Raft recommends broadcastTime << electionTimeout << MTBF.
Single Leader vs Multi-LeaderA single leader simplifies write ordering but creates a throughput bottleneck and a single point of failure. Multi-leader (or leaderless) architectures eliminate the bottleneck but require conflict resolution for concurrent writes. The choice depends on whether write throughput or consistency is the primary constraint.
Embedded vs External ElectionEmbedding Raft into the application provides low-latency election without external dependencies but increases application complexity and requires careful testing. Using ZooKeeper or etcd as an external coordinator provides a proven implementation but adds an operational dependency and network hop.
Lease DurationA short lease (5s) minimizes downtime after leader failure but requires frequent renewals and is sensitive to network jitter. A long lease (60s) is more stable but blocks progress for up to 60 seconds when the leader genuinely fails. Most systems use 10-30s leases as a compromise.
Case Study

Kafka's KRaft: From ZooKeeper to Self-Managed Leader Election

Scenario

Apache Kafka historically depended on ZooKeeper for controller election, broker registration, and topic metadata management. This external dependency added operational complexity, limited scalability to ~200,000 partitions per cluster, and created a bottleneck where all metadata changes had to pass through ZooKeeper's write path. The Kafka community needed a way to self-manage leader election and metadata.

Solution

KRaft (Kafka Raft) replaced ZooKeeper with an embedded Raft consensus protocol for Kafka's controller quorum. A set of controller nodes (typically 3 or 5) run Raft to replicate a metadata log. The Raft leader serves as the active controller, managing partition assignments, broker registration, and topic creation. Follower controllers maintain a complete copy of the metadata log and can take over immediately if the leader fails.

Outcome

KRaft reduced controller failover time from ~30 seconds (ZooKeeper session timeout) to ~5 seconds (Raft election timeout). It eliminated the ZooKeeper operational dependency, simplifying Kafka deployments. Metadata scalability improved to millions of partitions per cluster because metadata operations no longer bottlenecked through ZooKeeper's relatively slow consensus. KRaft became the default in Kafka 3.6 (2024), with ZooKeeper support deprecated.

Common Mistakes
  • Assuming the leader is always available. A leader can crash, partition, or slow down at any time. Every component that talks to a leader must handle leader changes gracefully -- retrying requests, refreshing leader addresses, and handling 'not leader' errors.
  • Not implementing fencing. During a leader transition, the old leader may still be processing requests while the new leader starts accepting writes. Without fencing (epoch numbers, lease invalidation), both can modify the same data concurrently.
  • Using a timestamp-based election instead of consensus. Choosing the leader as 'the node with the latest timestamp' fails because clock skew can cause two nodes to both believe they have the latest timestamp. Always use majority-based voting or a coordination service.
  • Placing the election mechanism in a single point of failure. If you use a single Redis instance for leader election, the Redis instance becomes the actual single point of failure. Use a consensus-backed service (etcd, ZooKeeper) or an embedded protocol (Raft).
Related Concepts

See Leader Election in action

Explore system design templates that use leader election and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Trigger leader failures and observe re-election

Metrics to watch
failover_time_mssplit_brain_eventswrite_availability_pctelection_count
Run Simulation
Test Your Understanding

1What is the primary safety concern during a leader election transition?

2In Raft, what prevents a node with an incomplete log from being elected leader?

Deeper Reading