1In basic Paxos, what happens if a proposer discovers during Phase 1 that an acceptor has already accepted a value?
Paxos is a family of protocols for achieving consensus among a group of unreliable processes. Originally described by Leslie Lamport, Paxos guarantees safety (agreement on a single value) under any non-Byzantine failure and remains the theoretical foundation for most production consensus systems.
Paxos, first described by Leslie Lamport in his 1989 paper 'The Part-Time Parliament' and later clarified in 'Paxos Made Simple' (2001), is the canonical algorithm for achieving consensus in asynchronous distributed systems subject to crash failures. The core problem Paxos solves is deceptively simple: how do N processes agree on a single value when any process can crash at any time and messages can be delayed, duplicated, or lost (but not corrupted)? Paxos guarantees that at most one value is chosen (safety) and that a value is eventually chosen if a majority of processes remain alive (liveness, under partial synchrony).
Basic Paxos operates in two phases. In Phase 1 (Prepare), a proposer selects a ballot number (a globally unique, monotonically increasing identifier) and sends a Prepare(n) message to a majority of acceptors. Each acceptor responds with a Promise not to accept any proposal with a ballot number less than n, and includes the highest-numbered proposal it has already accepted (if any). In Phase 2 (Accept), if the proposer receives promises from a majority, it sends an Accept(n, v) message where v is either the value from the highest-numbered previously accepted proposal, or the proposer's own value if no prior proposals exist. An acceptor accepts the proposal if it has not promised to a higher ballot. Once a majority of acceptors accept a value, consensus is achieved.
Basic Paxos decides a single value, which is rarely useful on its own. Production systems use Multi-Paxos, which elects a stable leader and runs consecutive Paxos instances to build a replicated state machine (a replicated log of commands). The leader skips Phase 1 for subsequent slots because it already holds the highest ballot, reducing each consensus round to a single Phase 2 exchange. This optimization transforms Paxos from a theoretical curiosity into a practical replication protocol with message complexity of 2 messages per decision (leader to acceptors, acceptors acknowledge).
Paxos is notoriously difficult to understand and implement correctly. Google's Chubby paper (2006) noted that implementing Paxos 'proved to be non-trivial' despite the team's deep expertise. The subtleties include handling leader changes, managing the gap between chosen and learned values, snapshotting and log compaction, and reconfiguring the cluster membership. These implementation challenges led Diego Ongaro to design Raft as a more understandable alternative. Nevertheless, Paxos remains the theoretical benchmark -- most consensus proofs reference Paxos, and understanding it is prerequisite knowledge for evaluating any consensus system.
Choosing a Meeting Room
Three office managers (acceptors) need to agree on which room to book for an all-hands meeting. Manager A proposes Room 101 with ticket #1. Before the vote completes, Manager B proposes Room 202 with higher ticket #2. The acceptors who already promised ticket #1 now receive a higher ticket and must honor ticket #2 instead. Manager B hears that no prior value was accepted, so Room 202 is chosen. If Manager A tries again with ticket #3, they first learn that Room 202 was already accepted at ticket #2, so they must re-propose Room 202 (not Room 101) -- this is how Paxos preserves the previously chosen value.
Google Chubby
Chubby is Google's distributed lock service, used internally for leader election and configuration management across thousands of services. It uses Multi-Paxos to replicate a small file system across 5 nodes in a cell. The Chubby paper (2006) details the engineering challenges of turning Paxos into a production system, including multi-database support, session management, and the decision to use a single master for reads/writes rather than allowing any replica to serve reads.
Google Spanner
Spanner uses Paxos-based replication within each split (data partition) to achieve CP behavior with 99.999% availability. Each Paxos group has 5 replicas across data centers. Spanner extends Multi-Paxos with TrueTime-based external consistency, allowing globally consistent reads at any replica without contacting the leader. The Paxos leader assigns timestamps to writes using the TrueTime API, and replicas can serve reads at any timestamp that has passed the TrueTime uncertainty window.
Apache ZooKeeper (ZAB)
ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), a protocol closely related to Multi-Paxos with additional ordering guarantees. ZAB ensures FIFO ordering of all client operations and atomic broadcast of state changes. While not exactly Paxos, ZAB shares the same quorum-based replication structure and was designed to address the same consensus problem with stricter ordering semantics needed for ZooKeeper's sequential consistency model.
| Aspect | Description |
|---|---|
| Safety vs Liveness | Paxos always guarantees safety -- at most one value is chosen -- but can sacrifice liveness if proposers continuously preempt each other (live-lock). The FLP impossibility theorem proves this trade-off is fundamental: no protocol can guarantee both in a fully asynchronous system. Production systems use leader election with timeouts to recover liveness. |
| Message Complexity | Basic Paxos requires 4 message delays per decision (2 round trips). Multi-Paxos with a stable leader reduces this to 2 message delays, but leader failover temporarily reverts to the 4-message path. Faster variants like Fast Paxos reduce to 3 messages in the optimistic case but have larger quorum requirements (3f+1 nodes). |
| Understandability vs Optimality | Paxos is provably correct but notoriously hard to understand and implement. Diego Ongaro's user studies showed that students took significantly longer to understand Paxos than Raft. The implementation gap between the paper and a production system is large -- Chubby, Spanner, and CockroachDB all required years of engineering beyond the basic protocol. |
| Latency vs Fault Tolerance | More replicas increase fault tolerance (5 nodes tolerate 2 failures) but increase quorum latency because the leader must wait for more acknowledgments. Cross-data-center Paxos adds 50-200ms per write for geographic fault tolerance. Flexible Paxos relaxes this by allowing different quorum sizes for Phase 1 and Phase 2, trading read availability for write performance. |
Google's Megastore -- Paxos at Planet Scale
Scenario
Google needed a globally distributed database for applications like Gmail and Google Drive that required strong consistency across data centers. Traditional single-leader replication could not provide both low-latency writes (requiring local leaders) and strong consistency (requiring global agreement). The system needed to handle billions of operations per day across data centers spanning multiple continents.
Solution
Megastore used Paxos to replicate entity groups across data centers, running an independent Paxos instance per entity group. Each write became a Paxos round, with replicas in 3+ data centers acting as acceptors. Megastore introduced 'coordinator' servers that tracked which entity groups had pending writes, allowing fast local reads when the coordinator confirmed the local replica was up-to-date. Writes paid the full cross-DC Paxos latency (~100-200ms), but reads were local when the coordinator was reachable.
Outcome
Megastore served Google's most critical applications at scale, proving that Paxos-based replication could work across continents. However, the per-entity-group throughput was limited to a few writes per second due to cross-DC Paxos latency. This limitation motivated the development of Spanner, which combined Paxos with TrueTime to achieve higher throughput and stronger guarantees. Megastore's experience directly informed the design of every subsequent Google storage system.
See Paxos in action
Explore system design templates that use paxos and run traffic simulations to see how these concepts perform under real load.
Browse Templates1In basic Paxos, what happens if a proposer discovers during Phase 1 that an acceptor has already accepted a value?
2What is the minimum number of nodes required for a Paxos cluster to tolerate 2 simultaneous failures?
3How does Multi-Paxos improve on basic Paxos?