Distributed Systems
Paxos
Цели урока
- Understand the consensus task and its four properties (validity, agreement, termination, fault tolerance)
- Walk through both phases of Basic Paxos: Prepare/Promise and Accept/Accepted
- Know the livelock problem and how Multi-Paxos resolves it via leader election
- Recognize production deployments: Chubby, Spanner, ZooKeeper, Raft
Предварительные знания
- Understanding of distributed systems: partial failures, network partitions
- CAP theorem and CP vs AP trade-off
- Basic knowledge of replication (leader/follower)
Google Chubby coordinates GFS, Bigtable, and Spanner - thousands of servers, petabytes of data - through an algorithm invented as a fable about a Greek parliament and rejected by academic journals twice.
- **Google Chubby (1998, Mike Burrows)** - Multi-Paxos in production 3 years before Lamport's paper was published. Used for distributed locks in GFS and Bigtable
- **Google Spanner** - Paxos groups per shard. Global strong consistency via atomic clocks + Paxos
- **Apache ZooKeeper** - Zab protocol (a Paxos variant) coordinates HBase and Hadoop; Kafka up to 4.0 (March 2025; now KRaft)
- **AWS DynamoDB** - Paxos for metadata consensus in the routing layer at hundreds of thousands of operations per second
- **etcd (Kubernetes)** - Raft (Paxos successor algorithm) stores all Kubernetes cluster state
The Fable of the Parliament of Paxos
Lamport described the algorithm as a story about the legislative parliament of the island of Paxos (Greece), where legislators voted on laws even when some of them had left for a break. The paper was rejected twice - by TOCS in 1990 and 1996. It was published in 1998 at the direct request of Butler Lampson (Turing Award 1992). By that time Lampson was personally telling colleagues about Paxos at conferences - the algorithm spread by word of mouth. The Turing Award came to Lamport in 2013, partly for Paxos and for logical clocks.
The Consensus Problem
**1989. Leslie Lamport writes a paper about the parliament of the island of Paxos (Greece) - a fable about how legislators vote when some of them are asleep or have left.** The journal rejected it. Resubmitted in 1998 - rejected again. Published in 2001. By that time the algorithm had been running in production at Google Chubby for three years - the distributed lock service for Bigtable, GFS, and Spanner. Lamport received the Turing Award in 2013, partly for Paxos. It is a concrete algorithm for the consensus problem.
The consensus task: N nodes must agree on one value, even if some nodes fail. Four properties must hold simultaneously:
| Property | What it means |
|---|---|
| **Validity** | The chosen value is one of the proposed values, not invented |
| **Agreement** | All live nodes choose the same value |
| **Termination** | The algorithm completes in finite time |
| **Fault tolerance** | Works when a minority of nodes fail (fewer than N/2) |
**Quorum:** Paxos requires a majority - N/2 + 1. With 5 nodes, 3 votes are needed. This survives failure of 2 nodes and guarantees that any two quorums share at least one node - the key property of the algorithm.
Why this is hard: consider two nodes simultaneously proposing different values. A third node receives both proposals in different orders (the network is non-deterministic). A mechanism is needed to guarantee that in the end all nodes see the same value - even if several nodes dropped out in the middle.
Consensus is just voting: whoever gets more votes wins
Consensus requires preserving already-accepted decisions across rounds, otherwise two different values can be 'chosen' in different rounds
Simple voting has no memory. In Paxos an acceptor promises (PROMISE) not to accept proposals with a lower number - and reports the last accepted value. Without this mechanism a second proposer could 'win' with a different value in the next round.
Why does Paxos require a quorum of N/2 + 1 rather than any majority?
Basic Paxos Protocol
Paxos has three roles: **Proposer** (proposes values), **Acceptor** (votes), **Learner** (learns the result). One node can play all three roles. The algorithm has two phases.
Phase 1: Prepare / Promise
Phase 2: Accept / Accepted
Example: P1 proposes X, P2 arrives late
P1 sends PREPARE(1) - all three acceptors reply PROMISE(1, null). P1 sends ACCEPT(1, "X") to A1 and A2 - both reply ACCEPTED. Quorum reached, "X" is chosen. P2 arrives late with PREPARE(2) - A1 and A2 reply PROMISE(2, "X", 1). P2 sees the already-accepted "X" and MUST propose exactly it. Result: all nodes agreed on "X" despite two competing proposers.
**Key invariant:** if a proposer sees an already-accepted value in a PROMISE, it must use that value. This guarantees agreement: two quorums share at least one node, which will report the accepted value.
A proposer received PROMISE from 3 of 5 acceptors. One PROMISE contains an accepted value 'Y' (accepted_N=5). What must the proposer do?
Multi-Paxos and Livelock
Basic Paxos solves single-value consensus. Real systems (replicated logs, distributed databases) need consensus for a sequence of commands. Basic Paxos also has a theoretical problem - livelock.
Livelock: two proposers blocking each other
The solution is **Multi-Paxos**: elect one stable leader. The leader is the only proposer, so there is no competition. After the first successful Paxos round (Phase 1), the leader can skip Phase 1 for all subsequent values.
| Mode | Round-trips per value | 100 values |
|---|---|---|
| Basic Paxos | 2 RTT (Prepare + Accept) | 200 RTT |
| Multi-Paxos (established leader) | 1 RTT (Accept only) | 2 + 99 = 101 RTT |
| Multi-Paxos (leader change) | 2 RTT for first value | recounts from new leader |
**FLP Impossibility (Fischer, Lynch, Paterson, 1985):** in an asynchronous system where even one node can fail, it is impossible to guarantee consensus in finite time. Paxos works around this by not guaranteeing termination in the worst case (livelock is theoretically possible) - but in practice randomized leader election and timeouts solve the problem.
Livelock in Paxos is a critical problem that makes the algorithm unsuitable in practice
Livelock is theoretically possible with competing proposers, but is solved in practice by electing a single leader
Multi-Paxos with leader election (see leader election) has been running in Google Chubby and Spanner since 1998. With a stable leader, livelock is impossible. Leader failure is handled via timeout and new election - this takes seconds, not an infinite loop.
Why does Multi-Paxos with an established leader require only 1 RTT instead of 2?
Paxos in Production
Paxos is rarely used as a vanilla algorithm - each system adapts it to its own needs. Google Chubby (2006) was the first widely documented production use and inspired ZooKeeper.
| System | Paxos Variant | Usage |
|---|---|---|
| Google Chubby | Multi-Paxos | Distributed lock service - coordinates GFS, Bigtable, Spanner |
| Google Spanner | Paxos groups | Per-shard consensus for a globally consistent DB |
| Apache ZooKeeper | Zab (Zookeeper Atomic Broadcast) | Similar to Paxos, optimized for leader-driven replication |
| AWS DynamoDB | Paxos | Metadata consensus for the routing layer |
| CockroachDB | Raft | Chose Raft as a more understandable Paxos-like algorithm |
Paxos vs Raft: choosing in practice
- Paxos — Theoretically optimal, flexible (many optimization variants), battle-tested since 1989. Downside: hard to understand and implement, recovery on leader failure requires careful engineering.
- Raft (2014) — Designed explicitly for understandability. Strong leader simplifies logic, many quality open-source implementations (etcd, TiKV). Downside: less flexible, leader can become a bottleneck.
**Practical advice:** for new projects - use Raft (etcd, HashiCorp Raft library, TiKV). Choose Paxos when a specific optimization is needed or the team already has deep expertise. Google uses Paxos because they started in 1998, before Raft existed.
Apache ZooKeeper uses Zab instead of vanilla Paxos. What is the main difference in Zab?
Вопросы для размышления
- Raft was designed explicitly for understandability and has surpassed Paxos in popularity for new projects. What does this say about the role of engineering readability - how important is theoretical optimality compared to simplicity of implementation and debugging?
Связанные уроки
- ds-03-consensus — Paxos is the canonical consensus algorithm
- ds-06-leader-election — Paxos includes leader-like proposer selection
- dist-07-2pc — 2PC is a special-case precursor to Paxos
- dist-09-raft — Raft is a more understandable alternative to Paxos
- dist-10-byzantine — BFT Paxos variants handle Byzantine failures
- alg-12-bfs