What is important about Raft Consensus Algorithm regarding "Raft divides time into terms. Each term has at most one lead..."?

Raft divides time into terms. Each term has at most one leader. Terms act as logical clocks -- if a node receives a message with a higher term, it immediately steps down to follower and updates its term.

What is important about Raft Consensus Algorithm regarding "Leader election requires a majority vote. A candidate's Requ..."?

Leader election requires a majority vote. A candidate's RequestVote RPC includes its last log index and term; voters only grant their vote if the candidate's log is at least as up-to-date as their own. This election restriction is the key to Raft's safety.

What is important about Raft Consensus Algorithm regarding "Log replication is leader-driven. The leader appends entries..."?

Log replication is leader-driven. The leader appends entries to its log, sends AppendEntries to followers, and commits once a majority acknowledge. Followers never initiate log changes -- they only apply entries the leader sends.

What is important about Raft Consensus Algorithm regarding "Committed means durable. Once the leader confirms a log entr..."?

Committed means durable. Once the leader confirms a log entry is replicated to a majority, it is committed and will survive any subsequent leader election. The new leader is guaranteed to contain all committed entries.

What is important about Raft Consensus Algorithm regarding "Raft's Log Matching Property states that if two logs contain..."?

Raft's Log Matching Property states that if two logs contain an entry with the same index and term, all preceding entries are identical. This is maintained by the AppendEntries consistency check and simplifies reasoning about log divergence.

What is important about Raft Consensus Algorithm regarding "Cluster membership changes use joint consensus or single-ser..."?

Cluster membership changes use joint consensus or single-server changes. Both approaches prevent the possibility of two independent majorities (split-brain) during the transition between configurations.

Vetora

🛶Consensus & Coordination

Raft Consensus Algorithm

Raft is a consensus algorithm designed for understandability. It provides the same safety guarantees as Multi-Paxos but decomposes the problem into leader election, log replication, and safety -- making it significantly easier to implement correctly. Raft powers etcd, CockroachDB, TiKV, and Consul.

Overview

Raft was designed by Diego Ongaro and John Ousterhout at Stanford and published in their 2014 USENIX ATC paper 'In Search of an Understandable Consensus Algorithm.' The motivation was simple: Paxos is notoriously difficult to understand and implement, yet consensus is a foundational building block for distributed systems. Ongaro's user studies showed that students learned Raft significantly faster than Paxos and produced more correct implementations. Raft provides the same formal safety guarantees as Multi-Paxos -- agreement, validity, and termination under partial synchrony -- but achieves this through a more structured decomposition.

Raft breaks consensus into three orthogonal sub-problems. First, leader election: time is divided into terms (monotonically increasing integers), and each term has at most one leader. If a follower's election timeout expires without hearing from a leader, it increments the term, votes for itself, and requests votes from peers. A candidate wins by receiving votes from a majority. Second, log replication: the leader accepts client commands, appends them to its log, and sends AppendEntries RPCs to all followers. Once a majority acknowledge the entry, the leader commits it and responds to the client. Third, safety: Raft's key safety property ensures that if a log entry is committed in a given term, that entry will be present in the logs of all future leaders. This is enforced by the election restriction -- a candidate cannot win an election unless its log is at least as up-to-date as a majority of voters.

Log matching is central to Raft's correctness. Each log entry contains the term in which it was created and its index position. The leader includes the index and term of the immediately preceding entry in every AppendEntries RPC. If a follower's log does not match at that point, it rejects the RPC, and the leader decrements its nextIndex for that follower and retries. This mechanism ensures that all committed entries are identical across all servers, building a consistent replicated state machine from a sequence of commands.

Raft handles cluster membership changes through joint consensus -- a two-phase approach where the cluster transitions through a configuration that requires majorities from both the old and new configurations before committing to the new one alone. This prevents split-brain scenarios during reconfiguration. In practice, most implementations use single-server membership changes (adding or removing one node at a time), which is simpler and sufficient for most operational needs. Raft also specifies log compaction via snapshots: once the log grows large, the state machine takes a snapshot and discards all preceding log entries, reducing storage and speeding up new follower catch-up.

Key Points

1Raft divides time into terms. Each term has at most one leader. Terms act as logical clocks -- if a node receives a message with a higher term, it immediately steps down to follower and updates its term.
2Leader election requires a majority vote. A candidate's RequestVote RPC includes its last log index and term; voters only grant their vote if the candidate's log is at least as up-to-date as their own. This election restriction is the key to Raft's safety.
3Log replication is leader-driven. The leader appends entries to its log, sends AppendEntries to followers, and commits once a majority acknowledge. Followers never initiate log changes -- they only apply entries the leader sends.
4Committed means durable. Once the leader confirms a log entry is replicated to a majority, it is committed and will survive any subsequent leader election. The new leader is guaranteed to contain all committed entries.
5Raft's Log Matching Property states that if two logs contain an entry with the same index and term, all preceding entries are identical. This is maintained by the AppendEntries consistency check and simplifies reasoning about log divergence.
6Cluster membership changes use joint consensus or single-server changes. Both approaches prevent the possibility of two independent majorities (split-brain) during the transition between configurations.

Simple Example

Three Servers Electing a Leader

Servers A, B, and C start as followers. Server A's election timer fires first (timers are randomized to avoid ties). A increments the term to 1, votes for itself, and sends RequestVote to B and C. Both B and C have not voted yet in term 1 and A's log is at least as up-to-date as theirs, so they grant their votes. A receives 3 votes (including its own self-vote) out of 3 -- a majority -- and becomes leader for term 1. A immediately sends an empty AppendEntries heartbeat to B and C, resetting their election timers. As long as heartbeats arrive before the timeout, A remains leader. If A crashes, B or C will time out and start a new election for term 2.

Real-World Examples

etcd (Kubernetes)

etcd is a distributed key-value store that uses Raft for consensus and serves as the brain of every Kubernetes cluster. All cluster state -- pods, services, deployments, config maps -- is stored in etcd and replicated via Raft. etcd typically runs as a 3- or 5-node cluster. The Raft leader handles all writes, and followers can serve linearizable reads by confirming with the leader that their state is up-to-date (ReadIndex). etcd's implementation of Raft is one of the most battle-tested in production.

CockroachDB

CockroachDB uses Raft to replicate individual ranges (data partitions) across nodes. Each range has its own Raft group, typically with 3 replicas. CockroachDB runs thousands of Raft groups per node using a multi-raft optimization that batches heartbeats and log entries across groups. Write operations are proposed to the range's Raft leader, replicated to a majority, and then applied to the local storage engine (Pebble). This per-range Raft approach allows CockroachDB to scale horizontally while maintaining serializable isolation.

HashiCorp Consul

Consul uses Raft to replicate its service catalog, health check state, and KV store across servers in a data center. The Raft leader handles all writes and can optionally serve stale reads from followers for lower latency. Consul's Raft implementation includes automatic snapshots for log compaction and supports online cluster resizing. Consul typically runs 3 or 5 server nodes, with thousands of agent nodes forwarding requests to the Raft cluster.

Trade-Offs

Aspect	Description
Understandability vs Flexibility	Raft trades protocol flexibility for clarity. By mandating a single leader and sequential log replication, it eliminates the ambiguity that makes Paxos implementations diverge. However, this structure means Raft has fewer optimization paths -- multi-leader or leaderless variants are outside Raft's design space. EPaxos, for example, can commit non-conflicting commands without a designated leader, achieving lower latency in geo-distributed deployments.
Leader Bottleneck	All writes must flow through the leader, making it a throughput bottleneck and a single point of failure (until re-election). In CockroachDB, this is mitigated by having a separate Raft group per range, distributing leadership across the cluster. For single Raft groups (like etcd), the leader's network bandwidth and disk I/O limit overall write throughput.
Election Timeout Tuning	The election timeout must be significantly longer than the heartbeat interval (typically 10x) but short enough to detect leader failures quickly. Too short: frequent unnecessary elections causing write stalls. Too long: extended downtime after a real leader failure. The optimal value depends on network latency, disk speed, and operational tolerance for downtime.
Linearizable Reads Cost	Serving linearizable reads from followers requires either contacting the leader (ReadIndex) or using a lease-based approach. ReadIndex adds one network round trip to every read. Lease-based reads avoid the network hop but depend on clock synchronization -- if clocks drift beyond the lease duration, stale reads become possible. Most systems offer configurable read consistency levels.

Case Study

etcd and the Kubernetes Control Plane

Scenario

Kubernetes stores all cluster state in etcd, including pod schedules, service endpoints, secrets, and config maps. A 5,000-node Kubernetes cluster generates thousands of watch events per second and requires fast, consistent reads for the API server. etcd must handle this workload while surviving node failures, network partitions, and data center maintenance without losing any committed state.

Solution

etcd runs a 5-node Raft cluster spread across failure domains (racks or zones). Writes go to the Raft leader, are replicated to a quorum (3 of 5), and committed. The leader's commit index advances, and all followers apply entries in order. For reads, etcd supports both linearizable reads (via ReadIndex, confirming leadership before responding) and serializable reads (served directly from any node, potentially stale). Kubernetes API servers typically use linearizable reads for critical state and serializable reads for caching. etcd compacts the Raft log via periodic snapshots, keeping disk usage bounded.

Outcome

etcd reliably serves Kubernetes clusters with thousands of nodes, handling sustained write throughputs of 10,000+ transactions per second. Leader elections complete within 1-3 seconds, causing brief write pauses but no data loss. The combination of Raft's simplicity and etcd's mature implementation makes it the de facto standard for Kubernetes state storage. etcd's Raft library has been extracted and reused in TiKV, Dgraph, and other systems.

Common Mistakes

⚠Running even-numbered Raft clusters (e.g., 4 nodes). A 4-node cluster tolerates only 1 failure (same as 3 nodes) because a majority of 4 is 3. You pay the cost of replicating to a 4th node without improving fault tolerance. Always use odd numbers: 3, 5, or 7.
⚠Setting election timeouts too low. If the timeout is close to the network round-trip time, transient packet delays cause unnecessary leader elections, which stall writes for the entire election duration. A common guideline is: broadcastTime << electionTimeout << MTBF.
⚠Assuming all Raft reads are linearizable by default. Many implementations (including etcd) offer stale/serializable reads for performance. If your application requires linearizable reads, you must explicitly configure them.
⚠Ignoring log compaction. Without snapshotting, the Raft log grows unboundedly, consuming disk space and slowing follower catch-up. Production systems must implement periodic snapshots and log truncation.

Related Concepts

Paxos Leader Election Linearizability Quorums Leader-Follower Replication Service Discovery

See Raft Consensus Algorithm in action

Explore system design templates that use raft consensus algorithm and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Watch Raft leader election maintain counter consistency

Metrics to watch

election_time_mslog_replication_lag_mswrite_throughput_rpsavailability_pct

Run Simulation

Test Your Understanding

1What prevents a node with an incomplete log from becoming the Raft leader?

2Why should Raft clusters use an odd number of nodes?

Deeper Reading