1What is the primary safety concern during a leader election transition?
Leader election is the process of designating a single node in a distributed system as the coordinator for a specific task. It is fundamental to many distributed algorithms -- from database replication to job scheduling -- and can be implemented via consensus protocols, lease-based mechanisms, or distributed lock services.
Leader election is one of the most common coordination tasks in distributed systems. A leader (also called master, primary, or coordinator) is a single node designated to perform a specific role -- accepting writes in a replicated database, scheduling tasks in a job system, or owning a partition in a stream processor. The leader pattern simplifies distributed protocols by converting a distributed consensus problem into a centralized decision-making one, at the cost of a single point of failure that must be quickly replaced.
There are several approaches to leader election. Consensus-based election uses Raft or Paxos, where the election mechanism is built into the protocol -- Raft's term-based voting and Paxos's ballot numbers both produce exactly one leader per round. External coordination services like ZooKeeper (ephemeral sequential znodes), etcd (lease-based), and Consul (session-based) provide leader election as a higher-level primitive: clients create ephemeral nodes or leases, and the client with the lowest sequence number or active lease is the leader. Simpler algorithms like the bully algorithm work in synchronous networks where timeout bounds are known.
The critical safety requirement is avoiding split-brain -- a situation where two nodes simultaneously believe they are the leader. Split-brain can cause data corruption (two leaders accepting conflicting writes), duplicate processing (two leaders executing the same job), and resource contention. Split-brain typically occurs during network partitions: the old leader is alive but unreachable, and a new leader is elected on the other side of the partition. Preventing split-brain requires either consensus (majority agreement) or fencing (the new leader invalidates the old leader's access to shared resources).
Liveness is equally important: after a leader failure, a new leader must be elected quickly to minimize downtime. The detection-to-election time depends on the heartbeat interval, election timeout, and the number of election rounds. In Raft, a typical failover takes 1-5 seconds (one election timeout plus one round of voting). ZooKeeper-based election is faster if the next-in-line client is watching the leader's ephemeral znode -- it detects deletion immediately and acquires leadership within the session timeout. Balancing fast detection (short timeouts) with stability (avoiding false positives) is the key operational trade-off.
Three Database Replicas Electing a Primary
Replicas A, B, and C form a cluster. A is the current leader, accepting all writes. A's heartbeat fails (A crashes or is partitioned). B and C detect the missing heartbeat after 5 seconds. B starts an election, increments the term, and asks C for a vote. C votes for B (B's log is up-to-date). B wins with 2 votes (itself + C) out of 3 -- a majority. B becomes the new leader and begins accepting writes. If A recovers, it discovers B's higher term and steps down to follower. At no point do both A and B act as leader in the same term.
Apache Kafka (Partition Leaders)
Kafka uses leader election for every topic partition. One broker is elected as the partition leader, handling all reads and writes for that partition. Followers replicate the leader's log. If the leader fails, the Kafka controller (itself elected via ZooKeeper or KRaft) selects a new leader from the in-sync replicas (ISR). Kafka's KRaft mode (replacing ZooKeeper) uses Raft for both controller election and metadata management, providing faster failover and eliminating the ZooKeeper dependency.
Kubernetes (Controller Leader Election)
Kubernetes controllers (scheduler, controller-manager) use leader election to ensure exactly one active instance. Multiple replicas run for high availability, but only the leader actively reconciles state. Leader election uses etcd leases: the leader creates a Lease object and renews it periodically (default: every 10s with a 15s TTL). If the leader pod crashes, the lease expires and another replica acquires it. The election is visible via the 'leases' API resource.
Google Chubby
Chubby provides leader election as a primary use case. A group of processes compete to acquire a Chubby lock (backed by Paxos consensus). The winner holds the lock and acts as leader. Other processes watch the lock and attempt acquisition when it is released. Chubby's design guarantees that the lock is held by exactly one client at a time, and the sequencer mechanism (similar to fencing tokens) prevents stale leaders from acting on shared state.
| Aspect | Description |
|---|---|
| Failover Speed vs False Positives | Shorter heartbeat intervals and election timeouts enable faster failover but increase the likelihood of unnecessary elections during transient network issues. An unnecessary election stalls writes for the election duration and may trigger cascading effects (e.g., client reconnections). Raft recommends broadcastTime << electionTimeout << MTBF. |
| Single Leader vs Multi-Leader | A single leader simplifies write ordering but creates a throughput bottleneck and a single point of failure. Multi-leader (or leaderless) architectures eliminate the bottleneck but require conflict resolution for concurrent writes. The choice depends on whether write throughput or consistency is the primary constraint. |
| Embedded vs External Election | Embedding Raft into the application provides low-latency election without external dependencies but increases application complexity and requires careful testing. Using ZooKeeper or etcd as an external coordinator provides a proven implementation but adds an operational dependency and network hop. |
| Lease Duration | A short lease (5s) minimizes downtime after leader failure but requires frequent renewals and is sensitive to network jitter. A long lease (60s) is more stable but blocks progress for up to 60 seconds when the leader genuinely fails. Most systems use 10-30s leases as a compromise. |
Kafka's KRaft: From ZooKeeper to Self-Managed Leader Election
Scenario
Apache Kafka historically depended on ZooKeeper for controller election, broker registration, and topic metadata management. This external dependency added operational complexity, limited scalability to ~200,000 partitions per cluster, and created a bottleneck where all metadata changes had to pass through ZooKeeper's write path. The Kafka community needed a way to self-manage leader election and metadata.
Solution
KRaft (Kafka Raft) replaced ZooKeeper with an embedded Raft consensus protocol for Kafka's controller quorum. A set of controller nodes (typically 3 or 5) run Raft to replicate a metadata log. The Raft leader serves as the active controller, managing partition assignments, broker registration, and topic creation. Follower controllers maintain a complete copy of the metadata log and can take over immediately if the leader fails.
Outcome
KRaft reduced controller failover time from ~30 seconds (ZooKeeper session timeout) to ~5 seconds (Raft election timeout). It eliminated the ZooKeeper operational dependency, simplifying Kafka deployments. Metadata scalability improved to millions of partitions per cluster because metadata operations no longer bottlenecked through ZooKeeper's relatively slow consensus. KRaft became the default in Kafka 3.6 (2024), with ZooKeeper support deprecated.
See Leader Election in action
Explore system design templates that use leader election and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary safety concern during a leader election transition?
2In Raft, what prevents a node with an incomplete log from being elected leader?