Vetora logo
📜Consensus & Coordination

Paxos

Paxos is a family of protocols for achieving consensus among a group of unreliable processes. Originally described by Leslie Lamport, Paxos guarantees safety (agreement on a single value) under any non-Byzantine failure and remains the theoretical foundation for most production consensus systems.

Overview

Paxos, first described by Leslie Lamport in his 1989 paper 'The Part-Time Parliament' and later clarified in 'Paxos Made Simple' (2001), is the canonical algorithm for achieving consensus in asynchronous distributed systems subject to crash failures. The core problem Paxos solves is deceptively simple: how do N processes agree on a single value when any process can crash at any time and messages can be delayed, duplicated, or lost (but not corrupted)? Paxos guarantees that at most one value is chosen (safety) and that a value is eventually chosen if a majority of processes remain alive (liveness, under partial synchrony).

Basic Paxos operates in two phases. In Phase 1 (Prepare), a proposer selects a ballot number (a globally unique, monotonically increasing identifier) and sends a Prepare(n) message to a majority of acceptors. Each acceptor responds with a Promise not to accept any proposal with a ballot number less than n, and includes the highest-numbered proposal it has already accepted (if any). In Phase 2 (Accept), if the proposer receives promises from a majority, it sends an Accept(n, v) message where v is either the value from the highest-numbered previously accepted proposal, or the proposer's own value if no prior proposals exist. An acceptor accepts the proposal if it has not promised to a higher ballot. Once a majority of acceptors accept a value, consensus is achieved.

Basic Paxos decides a single value, which is rarely useful on its own. Production systems use Multi-Paxos, which elects a stable leader and runs consecutive Paxos instances to build a replicated state machine (a replicated log of commands). The leader skips Phase 1 for subsequent slots because it already holds the highest ballot, reducing each consensus round to a single Phase 2 exchange. This optimization transforms Paxos from a theoretical curiosity into a practical replication protocol with message complexity of 2 messages per decision (leader to acceptors, acceptors acknowledge).

Paxos is notoriously difficult to understand and implement correctly. Google's Chubby paper (2006) noted that implementing Paxos 'proved to be non-trivial' despite the team's deep expertise. The subtleties include handling leader changes, managing the gap between chosen and learned values, snapshotting and log compaction, and reconfiguring the cluster membership. These implementation challenges led Diego Ongaro to design Raft as a more understandable alternative. Nevertheless, Paxos remains the theoretical benchmark -- most consensus proofs reference Paxos, and understanding it is prerequisite knowledge for evaluating any consensus system.

Key Points
  • 1Paxos guarantees safety (at most one value is chosen) under all non-Byzantine failures. It provides liveness (eventually a value is chosen) under partial synchrony when a majority of nodes are alive.
  • 2The two-phase protocol -- Prepare (claim a ballot) and Accept (propose a value) -- ensures that any proposer learns about previously accepted values before proposing, preventing conflicts.
  • 3A quorum (majority) is required at every step. With 2f+1 nodes, Paxos tolerates f simultaneous crash failures. Any two majorities overlap by at least one node, ensuring consistency.
  • 4Multi-Paxos optimizes basic Paxos by electing a stable leader who skips Phase 1, reducing per-decision cost from 4 messages to 2. This is how production systems (Chubby, Spanner, CockroachDB) actually work.
  • 5The FLP impossibility theorem proves that no deterministic consensus algorithm can guarantee liveness in a fully asynchronous system. Paxos circumvents this by relying on partial synchrony -- eventually, messages arrive within some unknown time bound.
  • 6Paxos is 'live-lock' susceptible: two competing proposers can repeatedly preempt each other with higher ballot numbers, preventing progress. Leader election (or randomized backoff) resolves this in practice.
Simple Example

Choosing a Meeting Room

Three office managers (acceptors) need to agree on which room to book for an all-hands meeting. Manager A proposes Room 101 with ticket #1. Before the vote completes, Manager B proposes Room 202 with higher ticket #2. The acceptors who already promised ticket #1 now receive a higher ticket and must honor ticket #2 instead. Manager B hears that no prior value was accepted, so Room 202 is chosen. If Manager A tries again with ticket #3, they first learn that Room 202 was already accepted at ticket #2, so they must re-propose Room 202 (not Room 101) -- this is how Paxos preserves the previously chosen value.

Real-World Examples

Google Chubby

Chubby is Google's distributed lock service, used internally for leader election and configuration management across thousands of services. It uses Multi-Paxos to replicate a small file system across 5 nodes in a cell. The Chubby paper (2006) details the engineering challenges of turning Paxos into a production system, including multi-database support, session management, and the decision to use a single master for reads/writes rather than allowing any replica to serve reads.

Google Spanner

Spanner uses Paxos-based replication within each split (data partition) to achieve CP behavior with 99.999% availability. Each Paxos group has 5 replicas across data centers. Spanner extends Multi-Paxos with TrueTime-based external consistency, allowing globally consistent reads at any replica without contacting the leader. The Paxos leader assigns timestamps to writes using the TrueTime API, and replicas can serve reads at any timestamp that has passed the TrueTime uncertainty window.

Apache ZooKeeper (ZAB)

ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), a protocol closely related to Multi-Paxos with additional ordering guarantees. ZAB ensures FIFO ordering of all client operations and atomic broadcast of state changes. While not exactly Paxos, ZAB shares the same quorum-based replication structure and was designed to address the same consensus problem with stricter ordering semantics needed for ZooKeeper's sequential consistency model.

Trade-Offs
AspectDescription
Safety vs LivenessPaxos always guarantees safety -- at most one value is chosen -- but can sacrifice liveness if proposers continuously preempt each other (live-lock). The FLP impossibility theorem proves this trade-off is fundamental: no protocol can guarantee both in a fully asynchronous system. Production systems use leader election with timeouts to recover liveness.
Message ComplexityBasic Paxos requires 4 message delays per decision (2 round trips). Multi-Paxos with a stable leader reduces this to 2 message delays, but leader failover temporarily reverts to the 4-message path. Faster variants like Fast Paxos reduce to 3 messages in the optimistic case but have larger quorum requirements (3f+1 nodes).
Understandability vs OptimalityPaxos is provably correct but notoriously hard to understand and implement. Diego Ongaro's user studies showed that students took significantly longer to understand Paxos than Raft. The implementation gap between the paper and a production system is large -- Chubby, Spanner, and CockroachDB all required years of engineering beyond the basic protocol.
Latency vs Fault ToleranceMore replicas increase fault tolerance (5 nodes tolerate 2 failures) but increase quorum latency because the leader must wait for more acknowledgments. Cross-data-center Paxos adds 50-200ms per write for geographic fault tolerance. Flexible Paxos relaxes this by allowing different quorum sizes for Phase 1 and Phase 2, trading read availability for write performance.
Case Study

Google's Megastore -- Paxos at Planet Scale

Scenario

Google needed a globally distributed database for applications like Gmail and Google Drive that required strong consistency across data centers. Traditional single-leader replication could not provide both low-latency writes (requiring local leaders) and strong consistency (requiring global agreement). The system needed to handle billions of operations per day across data centers spanning multiple continents.

Solution

Megastore used Paxos to replicate entity groups across data centers, running an independent Paxos instance per entity group. Each write became a Paxos round, with replicas in 3+ data centers acting as acceptors. Megastore introduced 'coordinator' servers that tracked which entity groups had pending writes, allowing fast local reads when the coordinator confirmed the local replica was up-to-date. Writes paid the full cross-DC Paxos latency (~100-200ms), but reads were local when the coordinator was reachable.

Outcome

Megastore served Google's most critical applications at scale, proving that Paxos-based replication could work across continents. However, the per-entity-group throughput was limited to a few writes per second due to cross-DC Paxos latency. This limitation motivated the development of Spanner, which combined Paxos with TrueTime to achieve higher throughput and stronger guarantees. Megastore's experience directly informed the design of every subsequent Google storage system.

Common Mistakes
  • Confusing basic Paxos with Multi-Paxos. Basic Paxos decides a single value; Multi-Paxos chains instances to build a replicated log. No production system runs basic Paxos as described in the paper.
  • Assuming Paxos requires a leader. Basic Paxos is leaderless -- any process can propose. Multi-Paxos adds a leader as an optimization to skip Phase 1, but the protocol is correct without one.
  • Forgetting that acceptors must persist their state to stable storage before responding. If an acceptor crashes and recovers with amnesia, it could violate its promises and break safety.
  • Implementing Paxos from the paper alone. The gap between Lamport's description and a working system includes log compaction, snapshotting, membership changes, batching, and pipelining -- all critical for production use.
Related Concepts

See Paxos in action

Explore system design templates that use paxos and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate Paxos consensus rounds for distributed agreement

Metrics to watch
consensus_latency_msproposal_rejection_ratethroughput_rpsavailability_pct
Run Simulation
Test Your Understanding

1In basic Paxos, what happens if a proposer discovers during Phase 1 that an acceptor has already accepted a value?

2What is the minimum number of nodes required for a Paxos cluster to tolerate 2 simultaneous failures?

3How does Multi-Paxos improve on basic Paxos?

Deeper Reading