Distributed Systems

Consensus: Paxos and Raft

Цели урока

Understand what consensus is and its three formal requirements
Know the FLP impossibility result and three practical workarounds
Follow the two phases of Paxos and the value-selection invariant
Understand the Raft state machine and log matching property
Choose between Paxos and Raft for specific use cases

Предварительные знания

CAP theorem: CP vs AP trade-off
Concepts of quorum and replication
Basic failure modes (crash, network partition)

etcd is the brain of every Kubernetes cluster. Without consensus in etcd, the cluster has no idea where pods live, who is leader, or which configuration is current. Consensus is not an academic curiosity - it is the heartbeat of production infrastructure.

**etcd (Kubernetes)** - stores all cluster state on Raft, 3-5 nodes, quorum mandatory
**Google Chubby** - distributed lock service on Paxos, coordinates thousands of internal Google services
**CockroachDB** - Raft per data range, globally distributed SQL database
**Apache ZooKeeper** - ZAB (ZooKeeper Atomic Broadcast) - a Paxos variant. It coordinated Kafka up to 4.0 (March 2025; now KRaft) and still coordinates HBase and Hadoop
**Consul** - Raft for service discovery and configuration, standard in the HashiCorp stack

Lamport and the history of Paxos

In 1989 Lamport described Paxos through a metaphor of a Greek parliament on the island of Paxos. TOCS rejected the paper as "too whimsical". Lamport removed the metaphor and resubmitted - rejected again. Published only in 1998 after Google and others had already been running it in production and citing the technical report. For Paxos and other work Lamport received the Turing Award in 2013.

The Consensus Problem

**etcd stores the configuration of every Kubernetes cluster on the planet. On April 1, 2020, Cloudflare lost connectivity between data centers: 180 seconds without consensus - and a third of pods across the cluster froze in Unknown state.** Consensus is the mechanism by which several independent nodes lock in a single agreed decision despite a subset of them failing. The failure model behind "a subset of them failing" lives in the distributed-systems intro.

**Formal requirements for consensus (Lamport, 1982):** 1. **Agreement** - all correct nodes decide on the same value. 2. **Validity** - the decided value was proposed by at least one node. 3. **Termination** - every correct node eventually decides.

Where consensus is used in production

Task	Why consensus is needed	What breaks without it
Cluster leader election	Only one node must be primary	Split-brain: two concurrent primaries accepting conflicting writes
Transaction log replication	All replicas must see the same operations in the same order	Data divergence between replicas
Distributed locks	Only one client holds the lock at any moment	Two clients concurrently modifying the same resource
Configuration agreement	All nodes see the same config	Half the cluster running on a stale configuration

Split-brain without consensus

Two PostgreSQL nodes lose connectivity for 30 seconds. Without consensus both treat themselves as primary and accept writes. The network recovers - the two nodes have diverged data. Which dataset is correct? There is no answer. PostgreSQL with Patroni uses etcd (Raft) precisely to prevent this: when quorum is lost the primary switches to read-only.

Consensus is only needed against Byzantine (malicious) failures

Consensus is needed for any independent failures including ordinary crashes and network partitions

Crash-fault tolerant consensus (Paxos, Raft) handles the most common failures - node crashes and network splits. Byzantine fault tolerant (PBFT, CometBFT - formerly Tendermint) is expensive and designed for untrusted participants like blockchains. 99% of production systems use crash-FT.

A 5-node cluster: one node crashed, two lost connectivity to the others (network partition). How many nodes can make a consensus decision?

FLP Impossibility and Workarounds

**1985. Fischer, Lynch, and Paterson publish a result that shook CS to its core: in a fully asynchronous system with even one potentially faulty node, no deterministic consensus algorithm is guaranteed to terminate.** Not an implementation limitation - a mathematical proof.

**FLP (Fischer-Lynch-Paterson, 1985):** In an asynchronous system with at least one potentially faulty node, no deterministic algorithm can guarantee both Safety (agreement) and Liveness (termination) simultaneously.

The core issue: in an asynchronous network the algorithm cannot tell a slow node apart from a crashed one. Nothing in the protocol says "wait one more second for the response" or "declare the node dead and move on". The conflict has no clean resolution.

Three ways to work around FLP in practice

Method	Idea	Usage
Partial synchrony (timeouts)	System may be asynchronous but behaves synchronously during stable periods. Timeout = decision that a node has crashed.	Paxos, Raft - election timeout 150-300ms
Randomization	Algorithm takes random steps. Termination guaranteed with probability 1 but not deterministically.	Ben-Or 1983, randomized consensus protocols
Failure detectors	Separate mechanism (Oracle) that tells the algorithm who has crashed. Detector may be wrong but has eventual accuracy.	Chandra-Toueg 1996, heartbeat-based detectors

**Why Raft and Paxos work in practice:** they trade Liveness (may stall) for Safety. During a network partition they stop and wait rather than commit conflicting decisions. Availability drops; data never diverges. That is CP in the CAP model.

A Raft cluster of 5 nodes cannot complete a leader election for 10 seconds. Is this a bug or expected behavior?

Paxos: The Two-Phase Algorithm

**Leslie Lamport invented Paxos in 1989, dressing it up as a story about a Greek parliament. TOCS rejected the first submission as "not serious enough". Published only in 1998 - by which point Google and others were already running it in production.** ZooKeeper, Chubby (Google), Cassandra - all built on Paxos or its variants. The wider Paxos family is unpacked in the Paxos lesson.

Participant roles

Role	Function	Who in a real system
Proposer	Proposes a value, coordinates the vote	Client or leader node
Acceptor	Votes on proposals, stores promises	Cluster nodes (quorum)
Learner	Learns about the decided value and applies it	Replicas, clients

Phase 1: Prepare / Promise

Phase 2: Accept / Accepted

**Paxos's load-bearing invariant:** if value v was accepted by a quorum in round n, any future Proposer with n' > n collects a Promise quorum in which at least one Acceptor reports the accepted (n, v). The Proposer is forced to pick v. That is exactly how Paxos guarantees Agreement.

In Paxos the Proposer always proposes its own value

The Proposer must propose the value with the highest previously accepted round number found in the Promise responses

This is the key safety mechanism. If another Proposer reached consensus in round n=5 with value 'X', a new Proposer with n=7 learns this from the Promises and continues with 'X'. Without this rule two Proposers could reach different consensus values.

A Proposer received Promises from 2 of 3 acceptors. One Promise contains highestAccepted = {n: 5, value: 'X'}. Which value must the Proposer propose in the Accept phase?

Raft: Consensus for Humans

**2014. Diego Ongaro and John Ousterhout publish Raft: "In Search of an Understandable Consensus Algorithm". Goal: not a new algorithm, the same consensus made readable. etcd (Kubernetes), CockroachDB, TiKV, Consul - all picked Raft.** The difference from Paxos is not in guarantees but in decomposition: Raft explicitly splits leader election, log replication, and safety. Production tuning of Raft lives in the Raft lesson.

Three node states

Paxos vs Raft: which to choose

Aspect	Paxos (Multi-Paxos)	Raft
Understandability	High complexity, many implicit details	Explicit decomposition, easier to implement correctly
Leader	Optional (leaderless variants exist)	Required, all writes go through leader
Replication	One value per round	Continuous operations log
Throughput	Can parallelize (Multi-Paxos)	Sequential through leader
Adoption	ZooKeeper (ZAB ~= Paxos), Chubby	etcd, Consul, CockroachDB, TiKV

**Raft's log matching property:** if two entries in the logs of different nodes share the same index and term, every preceding entry is identical as well. The leader never overwrites its own log. Followers get reconciled to the leader on divergence. That is the load-bearing safety property.

Raft is safer than Paxos because it is simpler

Raft and Paxos provide equivalent safety guarantees; Raft's simplicity is about implementation, not greater reliability

Both algorithms solve the same problem with the same constraints (FLP). Raft is easier to understand and implement correctly, which in practice reduces implementation bugs. But formally their guarantees are identical under the crash-fault model.

In a 5-node Raft cluster the leader crashes right after appending a command to its own log but before sending it to any follower. What happens to that command?

Key Takeaways

**Consensus** - Lamport's three requirements: Agreement (all correct nodes decide the same value), Validity (the value was proposed), Termination (every correct node eventually decides)
**FLP impossibility** - in an asynchronous system with even one faulty node, no deterministic algorithm can guarantee both Safety and Liveness simultaneously
**FLP workarounds** - partial synchrony (timeouts in Paxos/Raft), randomization, failure detectors; Raft and Paxos trade Liveness for Safety
**Paxos (two phases)** - Prepare/Promise locks out older rounds; Accept/Accepted commits the value with the highest previously accepted round number
**Raft** - explicit decomposition: leader election (randomized timeout), log replication (AppendEntries), safety (log matching property)
**Paxos vs Raft** - identical safety guarantees; Raft is easier to implement correctly - that is why etcd, CockroachDB, and Consul chose it
**Quorum** - for a cluster of N nodes, majority is floor(N/2)+1; odd cluster sizes give a clear majority without tie-breaking

Вопросы для размышления

etcd recommends 3 or 5 nodes but not 4 or 6. Why is an odd number of nodes more practical for quorum than an even number?

Связанные уроки

ds-04-clocks — Logical ordering underpins agreement on event sequence
dist-08-paxos — Multi-Paxos and Fast Paxos build on the two-phase protocol
dist-09-raft — Raft variants like Joint Consensus extend the leader-based core
dist-10-byzantine — PBFT and CometBFT (formerly Tendermint) generalise consensus to malicious nodes
ds-02-cap-theorem — Quorum and CP behaviour come from the CAP trade-off
st-12-organizations — Voting in democracies mirrors quorum-based agreement
ibd-21-docker-k8s-interview
alg-12-bfs