1What prevents a node with an incomplete log from becoming the Raft leader?
Raft is a consensus algorithm designed for understandability. It provides the same safety guarantees as Multi-Paxos but decomposes the problem into leader election, log replication, and safety -- making it significantly easier to implement correctly. Raft powers etcd, CockroachDB, TiKV, and Consul.
Raft was designed by Diego Ongaro and John Ousterhout at Stanford and published in their 2014 USENIX ATC paper 'In Search of an Understandable Consensus Algorithm.' The motivation was simple: Paxos is notoriously difficult to understand and implement, yet consensus is a foundational building block for distributed systems. Ongaro's user studies showed that students learned Raft significantly faster than Paxos and produced more correct implementations. Raft provides the same formal safety guarantees as Multi-Paxos -- agreement, validity, and termination under partial synchrony -- but achieves this through a more structured decomposition.
Raft breaks consensus into three orthogonal sub-problems. First, leader election: time is divided into terms (monotonically increasing integers), and each term has at most one leader. If a follower's election timeout expires without hearing from a leader, it increments the term, votes for itself, and requests votes from peers. A candidate wins by receiving votes from a majority. Second, log replication: the leader accepts client commands, appends them to its log, and sends AppendEntries RPCs to all followers. Once a majority acknowledge the entry, the leader commits it and responds to the client. Third, safety: Raft's key safety property ensures that if a log entry is committed in a given term, that entry will be present in the logs of all future leaders. This is enforced by the election restriction -- a candidate cannot win an election unless its log is at least as up-to-date as a majority of voters.
Log matching is central to Raft's correctness. Each log entry contains the term in which it was created and its index position. The leader includes the index and term of the immediately preceding entry in every AppendEntries RPC. If a follower's log does not match at that point, it rejects the RPC, and the leader decrements its nextIndex for that follower and retries. This mechanism ensures that all committed entries are identical across all servers, building a consistent replicated state machine from a sequence of commands.
Raft handles cluster membership changes through joint consensus -- a two-phase approach where the cluster transitions through a configuration that requires majorities from both the old and new configurations before committing to the new one alone. This prevents split-brain scenarios during reconfiguration. In practice, most implementations use single-server membership changes (adding or removing one node at a time), which is simpler and sufficient for most operational needs. Raft also specifies log compaction via snapshots: once the log grows large, the state machine takes a snapshot and discards all preceding log entries, reducing storage and speeding up new follower catch-up.
Three Servers Electing a Leader
Servers A, B, and C start as followers. Server A's election timer fires first (timers are randomized to avoid ties). A increments the term to 1, votes for itself, and sends RequestVote to B and C. Both B and C have not voted yet in term 1 and A's log is at least as up-to-date as theirs, so they grant their votes. A receives 3 votes (including its own self-vote) out of 3 -- a majority -- and becomes leader for term 1. A immediately sends an empty AppendEntries heartbeat to B and C, resetting their election timers. As long as heartbeats arrive before the timeout, A remains leader. If A crashes, B or C will time out and start a new election for term 2.
etcd (Kubernetes)
etcd is a distributed key-value store that uses Raft for consensus and serves as the brain of every Kubernetes cluster. All cluster state -- pods, services, deployments, config maps -- is stored in etcd and replicated via Raft. etcd typically runs as a 3- or 5-node cluster. The Raft leader handles all writes, and followers can serve linearizable reads by confirming with the leader that their state is up-to-date (ReadIndex). etcd's implementation of Raft is one of the most battle-tested in production.
CockroachDB
CockroachDB uses Raft to replicate individual ranges (data partitions) across nodes. Each range has its own Raft group, typically with 3 replicas. CockroachDB runs thousands of Raft groups per node using a multi-raft optimization that batches heartbeats and log entries across groups. Write operations are proposed to the range's Raft leader, replicated to a majority, and then applied to the local storage engine (Pebble). This per-range Raft approach allows CockroachDB to scale horizontally while maintaining serializable isolation.
HashiCorp Consul
Consul uses Raft to replicate its service catalog, health check state, and KV store across servers in a data center. The Raft leader handles all writes and can optionally serve stale reads from followers for lower latency. Consul's Raft implementation includes automatic snapshots for log compaction and supports online cluster resizing. Consul typically runs 3 or 5 server nodes, with thousands of agent nodes forwarding requests to the Raft cluster.
| Aspect | Description |
|---|---|
| Understandability vs Flexibility | Raft trades protocol flexibility for clarity. By mandating a single leader and sequential log replication, it eliminates the ambiguity that makes Paxos implementations diverge. However, this structure means Raft has fewer optimization paths -- multi-leader or leaderless variants are outside Raft's design space. EPaxos, for example, can commit non-conflicting commands without a designated leader, achieving lower latency in geo-distributed deployments. |
| Leader Bottleneck | All writes must flow through the leader, making it a throughput bottleneck and a single point of failure (until re-election). In CockroachDB, this is mitigated by having a separate Raft group per range, distributing leadership across the cluster. For single Raft groups (like etcd), the leader's network bandwidth and disk I/O limit overall write throughput. |
| Election Timeout Tuning | The election timeout must be significantly longer than the heartbeat interval (typically 10x) but short enough to detect leader failures quickly. Too short: frequent unnecessary elections causing write stalls. Too long: extended downtime after a real leader failure. The optimal value depends on network latency, disk speed, and operational tolerance for downtime. |
| Linearizable Reads Cost | Serving linearizable reads from followers requires either contacting the leader (ReadIndex) or using a lease-based approach. ReadIndex adds one network round trip to every read. Lease-based reads avoid the network hop but depend on clock synchronization -- if clocks drift beyond the lease duration, stale reads become possible. Most systems offer configurable read consistency levels. |
etcd and the Kubernetes Control Plane
Scenario
Kubernetes stores all cluster state in etcd, including pod schedules, service endpoints, secrets, and config maps. A 5,000-node Kubernetes cluster generates thousands of watch events per second and requires fast, consistent reads for the API server. etcd must handle this workload while surviving node failures, network partitions, and data center maintenance without losing any committed state.
Solution
etcd runs a 5-node Raft cluster spread across failure domains (racks or zones). Writes go to the Raft leader, are replicated to a quorum (3 of 5), and committed. The leader's commit index advances, and all followers apply entries in order. For reads, etcd supports both linearizable reads (via ReadIndex, confirming leadership before responding) and serializable reads (served directly from any node, potentially stale). Kubernetes API servers typically use linearizable reads for critical state and serializable reads for caching. etcd compacts the Raft log via periodic snapshots, keeping disk usage bounded.
Outcome
etcd reliably serves Kubernetes clusters with thousands of nodes, handling sustained write throughputs of 10,000+ transactions per second. Leader elections complete within 1-3 seconds, causing brief write pauses but no data loss. The combination of Raft's simplicity and etcd's mature implementation makes it the de facto standard for Kubernetes state storage. etcd's Raft library has been extracted and reused in TiKV, Dgraph, and other systems.
See Raft Consensus Algorithm in action
Explore system design templates that use raft consensus algorithm and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What prevents a node with an incomplete log from becoming the Raft leader?
2Why should Raft clusters use an odd number of nodes?