Distributed Systems
Consensus: Paxos and Raft
Цели урока
- Understand what consensus is and its three formal requirements
- Know the FLP impossibility result and three practical workarounds
- Follow the two phases of Paxos and the value-selection invariant
- Understand the Raft state machine and log matching property
- Choose between Paxos and Raft for specific use cases
Предварительные знания
- CAP theorem: CP vs AP trade-off
- Concepts of quorum and replication
- Basic failure modes (crash, network partition)
etcd is the brain of every Kubernetes cluster. Without consensus in etcd, the cluster has no idea where pods live, who is leader, or which configuration is current. Consensus is not an academic curiosity - it is the heartbeat of production infrastructure.
- **etcd (Kubernetes)** - stores all cluster state on Raft, 3-5 nodes, quorum mandatory
- **Google Chubby** - distributed lock service on Paxos, coordinates thousands of internal Google services
- **CockroachDB** - Raft per data range, globally distributed SQL database
- **Apache ZooKeeper** - ZAB (ZooKeeper Atomic Broadcast) - a Paxos variant. It coordinated Kafka up to 4.0 (March 2025; now KRaft) and still coordinates HBase and Hadoop
- **Consul** - Raft for service discovery and configuration, standard in the HashiCorp stack
Lamport and the history of Paxos
In 1989 Lamport described Paxos through a metaphor of a Greek parliament on the island of Paxos. TOCS rejected the paper as "too whimsical". Lamport removed the metaphor and resubmitted - rejected again. Published only in 1998 after Google and others had already been running it in production and citing the technical report. For Paxos and other work Lamport received the Turing Award in 2013.
The Consensus Problem
**etcd stores the configuration of every Kubernetes cluster on the planet. On April 1, 2020, Cloudflare lost connectivity between data centers: 180 seconds without consensus - and a third of pods across the cluster froze in Unknown state.** Consensus is the mechanism by which several independent nodes lock in a single agreed decision despite a subset of them failing. The failure model behind "a subset of them failing" lives in the distributed-systems intro.
**Formal requirements for consensus (Lamport, 1982):** 1. **Agreement** - all correct nodes decide on the same value. 2. **Validity** - the decided value was proposed by at least one node. 3. **Termination** - every correct node eventually decides.
Where consensus is used in production
| Task | Why consensus is needed | What breaks without it |
|---|---|---|
| Cluster leader election | Only one node must be primary | Split-brain: two concurrent primaries accepting conflicting writes |
| Transaction log replication | All replicas must see the same operations in the same order | Data divergence between replicas |
| Distributed locks | Only one client holds the lock at any moment | Two clients concurrently modifying the same resource |
| Configuration agreement | All nodes see the same config | Half the cluster running on a stale configuration |
Split-brain without consensus
Two PostgreSQL nodes lose connectivity for 30 seconds. Without consensus both treat themselves as primary and accept writes. The network recovers - the two nodes have diverged data. Which dataset is correct? There is no answer. PostgreSQL with Patroni uses etcd (Raft) precisely to prevent this: when quorum is lost the primary switches to read-only.
Consensus is only needed against Byzantine (malicious) failures
Consensus is needed for any independent failures including ordinary crashes and network partitions
Crash-fault tolerant consensus (Paxos, Raft) handles the most common failures - node crashes and network splits. Byzantine fault tolerant (PBFT, CometBFT - formerly Tendermint) is expensive and designed for untrusted participants like blockchains. 99% of production systems use crash-FT.
A 5-node cluster: one node crashed, two lost connectivity to the others (network partition). How many nodes can make a consensus decision?
FLP Impossibility and Workarounds
**1985. Fischer, Lynch, and Paterson publish a result that shook CS to its core: in a fully asynchronous system with even one potentially faulty node, no deterministic consensus algorithm is guaranteed to terminate.** Not an implementation limitation - a mathematical proof.
**FLP (Fischer-Lynch-Paterson, 1985):** In an asynchronous system with at least one potentially faulty node, no deterministic algorithm can guarantee both Safety (agreement) and Liveness (termination) simultaneously.
The core issue: in an asynchronous network the algorithm cannot tell a slow node apart from a crashed one. Nothing in the protocol says "wait one more second for the response" or "declare the node dead and move on". The conflict has no clean resolution.
Three ways to work around FLP in practice
| Method | Idea | Usage |
|---|---|---|
| Partial synchrony (timeouts) | System may be asynchronous but behaves synchronously during stable periods. Timeout = decision that a node has crashed. | Paxos, Raft - election timeout 150-300ms |
| Randomization | Algorithm takes random steps. Termination guaranteed with probability 1 but not deterministically. | Ben-Or 1983, randomized consensus protocols |
| Failure detectors | Separate mechanism (Oracle) that tells the algorithm who has crashed. Detector may be wrong but has eventual accuracy. | Chandra-Toueg 1996, heartbeat-based detectors |
**Why Raft and Paxos work in practice:** they trade Liveness (may stall) for Safety. During a network partition they stop and wait rather than commit conflicting decisions. Availability drops; data never diverges. That is CP in the CAP model.
A Raft cluster of 5 nodes cannot complete a leader election for 10 seconds. Is this a bug or expected behavior?
Paxos: The Two-Phase Algorithm
**Leslie Lamport invented Paxos in 1989, dressing it up as a story about a Greek parliament. TOCS rejected the first submission as "not serious enough". Published only in 1998 - by which point Google and others were already running it in production.** ZooKeeper, Chubby (Google), Cassandra - all built on Paxos or its variants. The wider Paxos family is unpacked in the Paxos lesson.
Participant roles
| Role | Function | Who in a real system |
|---|---|---|
| Proposer | Proposes a value, coordinates the vote | Client or leader node |
| Acceptor | Votes on proposals, stores promises | Cluster nodes (quorum) |
| Learner | Learns about the decided value and applies it | Replicas, clients |
Phase 1: Prepare / Promise
Phase 2: Accept / Accepted
**Paxos's load-bearing invariant:** if value v was accepted by a quorum in round n, any future Proposer with n' > n collects a Promise quorum in which at least one Acceptor reports the accepted (n, v). The Proposer is forced to pick v. That is exactly how Paxos guarantees Agreement.
In Paxos the Proposer always proposes its own value
The Proposer must propose the value with the highest previously accepted round number found in the Promise responses
This is the key safety mechanism. If another Proposer reached consensus in round n=5 with value 'X', a new Proposer with n=7 learns this from the Promises and continues with 'X'. Without this rule two Proposers could reach different consensus values.
A Proposer received Promises from 2 of 3 acceptors. One Promise contains highestAccepted = {n: 5, value: 'X'}. Which value must the Proposer propose in the Accept phase?
Raft: Consensus for Humans
**2014. Diego Ongaro and John Ousterhout publish Raft: "In Search of an Understandable Consensus Algorithm". Goal: not a new algorithm, the same consensus made readable. etcd (Kubernetes), CockroachDB, TiKV, Consul - all picked Raft.** The difference from Paxos is not in guarantees but in decomposition: Raft explicitly splits leader election, log replication, and safety. Production tuning of Raft lives in the Raft lesson.
Three node states
Paxos vs Raft: which to choose
| Aspect | Paxos (Multi-Paxos) | Raft |
|---|---|---|
| Understandability | High complexity, many implicit details | Explicit decomposition, easier to implement correctly |
| Leader | Optional (leaderless variants exist) | Required, all writes go through leader |
| Replication | One value per round | Continuous operations log |
| Throughput | Can parallelize (Multi-Paxos) | Sequential through leader |
| Adoption | ZooKeeper (ZAB ~= Paxos), Chubby | etcd, Consul, CockroachDB, TiKV |
**Raft's log matching property:** if two entries in the logs of different nodes share the same index and term, every preceding entry is identical as well. The leader never overwrites its own log. Followers get reconciled to the leader on divergence. That is the load-bearing safety property.
Raft is safer than Paxos because it is simpler
Raft and Paxos provide equivalent safety guarantees; Raft's simplicity is about implementation, not greater reliability
Both algorithms solve the same problem with the same constraints (FLP). Raft is easier to understand and implement correctly, which in practice reduces implementation bugs. But formally their guarantees are identical under the crash-fault model.
In a 5-node Raft cluster the leader crashes right after appending a command to its own log but before sending it to any follower. What happens to that command?
Key Takeaways
- **Consensus** - Lamport's three requirements: Agreement (all correct nodes decide the same value), Validity (the value was proposed), Termination (every correct node eventually decides)
- **FLP impossibility** - in an asynchronous system with even one faulty node, no deterministic algorithm can guarantee both Safety and Liveness simultaneously
- **FLP workarounds** - partial synchrony (timeouts in Paxos/Raft), randomization, failure detectors; Raft and Paxos trade Liveness for Safety
- **Paxos (two phases)** - Prepare/Promise locks out older rounds; Accept/Accepted commits the value with the highest previously accepted round number
- **Raft** - explicit decomposition: leader election (randomized timeout), log replication (AppendEntries), safety (log matching property)
- **Paxos vs Raft** - identical safety guarantees; Raft is easier to implement correctly - that is why etcd, CockroachDB, and Consul chose it
- **Quorum** - for a cluster of N nodes, majority is floor(N/2)+1; odd cluster sizes give a clear majority without tie-breaking
Вопросы для размышления
- etcd recommends 3 or 5 nodes but not 4 or 6. Why is an odd number of nodes more practical for quorum than an even number?
Связанные уроки
- ds-04-clocks — Logical ordering underpins agreement on event sequence
- dist-08-paxos — Multi-Paxos and Fast Paxos build on the two-phase protocol
- dist-09-raft — Raft variants like Joint Consensus extend the leader-based core
- dist-10-byzantine — PBFT and CometBFT (formerly Tendermint) generalise consensus to malicious nodes
- ds-02-cap-theorem — Quorum and CP behaviour come from the CAP trade-off
- st-12-organizations — Voting in democracies mirrors quorum-based agreement
- ibd-21-docker-k8s-interview
- alg-12-bfs