Distributed Systems
Raft
Цели урока
- Understand the three node roles and transitions between them
- Explain the leader election algorithm and the role of randomized timeout
- Describe log replication via AppendEntries
- Know the five Safety properties and why entries from foreign terms cannot be committed directly
- Navigate real implementations (etcd, hashicorp/raft)
Предварительные знания
- Understanding of the consensus problem in distributed systems
- Basic knowledge of leader election (heartbeats, quorum)
- Familiarity with replicated state machine
5 of 8 independent Paxos implementations contained safety bugs (CMU, 2014). Raft (a Paxos successor) was designed with one goal: make consensus understandable enough to be implemented correctly the first time.
- **Kubernetes** uses etcd/raft to store all control plane state
- **CockroachDB** replicates data via Raft per-range (each key range is its own Raft cluster)
- **Consul** elects the service cluster leader via hashicorp/raft
- **TiKV** (TiDB storage) uses tikv/raft-rs in Rust for ACID transactions
- **Vault** (HashiCorp) stores secrets in a Raft cluster with encryption
Diego Ongaro and the Birth of Raft
Diego Ongaro defended his dissertation at Stanford in 2014 under John Ousterhout. The central thesis: understandability should be a first-class design goal, not a side effect. The authors conducted a user study: after instruction, students were tested on both algorithms. Raft results were significantly better (p < 0.05). The name Raft stands for "Reliable, Replicated, Redundant, Fault-Tolerant" - and is also the raft used to cross the Paxos river.
Node Roles and the Term Concept
**2014. Diego Ongaro, Stanford. Dissertation: "In Search of an Understandable Consensus Algorithm".** Paxos had existed since 1989, but was so complex that most implementations contained bugs - a CMU study found that 5 of 8 independent Paxos implementations had safety errors. Raft was the answer: same guarantees, half the concepts. Today Kubernetes, Consul, CockroachDB, and TiKV all run on Raft - the consensus problem solved practically.
Every node in a Raft cluster is always in one of three states: **Leader**, **Follower**, or **Candidate**. At any given moment there is at most one leader in the cluster.
| State | Role | What it does |
|---|---|---|
| Leader | Manages the cluster | Accepts client requests, replicates the log, sends heartbeats |
| Follower | Passive node | Responds to leader and candidate RPCs, never initiates anything |
| Candidate | Contender | Temporary state during elections - attempts to become Leader |
**Term** - a logical time period in Raft. Each term starts with an election. Terms increase monotonically and never decrease. If a node sees a term greater than its own - it immediately updates and reverts to Follower. At most one Leader per term.
If a Follower receives no heartbeat within the election timeout (default 150-300 ms) - it assumes the leader has died and transitions to Candidate, starting a new term.
Term in Raft is physical time, like a second counter
Term is a logical counter that increments monotonically with each new election. It has no connection to wall-clock time.
Raft deliberately avoids depending on clock synchronization. Terms serve to determine information freshness: higher term = more up-to-date data.
A Raft node receives AppendEntries with term=5, but its own currentTerm=7. What happens?
Leader Election
When the election timeout expires, a Follower becomes a Candidate: increments currentTerm, votes for itself, and broadcasts RequestVote to all nodes. If the candidate receives votes from the majority (N/2 + 1) - it becomes Leader and immediately sends heartbeats to stop other elections.
Key voting condition: the candidate's log must be **at least as up-to-date** as the voter's log. This guarantees the leader will have the most recent data. Comparison: first by lastLogTerm (higher = fresher), then by lastLogIndex (longer = fresher).
**Split vote** - when multiple candidates start simultaneously and nobody reaches majority. Solution: **randomized election timeout** (each node waits a random 150-300 ms). The first to wake up collects votes before others start.
Partition: what happens to the old leader
Cluster of 5 nodes. A network split isolates the leader (Leader-A) from 3 nodes. The majority (3/5) elects a new Leader-B with term=6. Leader-A continues accepting requests but cannot commit any - no quorum. Clients get timeouts. When the network heals - Leader-A sees term=6 > its own term, instantly reverts to Follower. All uncommitted entries from Leader-A get overwritten.
In a 5-node cluster, a split vote occurred: 2 candidates each got 2 votes. What happens next?
Log Replication
The Replicated Log is the heart of Raft. Each entry contains: **index** (position), **term** (which term it was created in), and **command** (for the state machine). The leader accepts a command from a client, appends it to its own log, and sends AppendEntries to all followers. An entry is considered **committed** when replicated to the majority of nodes.
| AppendEntries field | Purpose |
|---|---|
| prevLogIndex / prevLogTerm | Consistency check: follower must have an entry at this index with this term |
| entries[] | New entries to append (empty array = heartbeat) |
| leaderCommit | Leader tells follower up to which index it can apply entries to state machine |
**Critical commit rule:** the leader only commits entries from **its own term**. Entries from previous terms are committed implicitly alongside the current term's entry. This protects against the Figure 8 scenario: without this rule, a committed entry could be overwritten on leader change.
Figure 8: why old-term entries cannot be committed directly
Term 2: S1 (leader) replicates entry to S2. Logs: S1=[1,2], S2=[1,2], S3=[1]. Term 3: S5 becomes leader, appends term-3 entry. S5=[1,3]. Term 4: S1 is leader again, replicates term-2 entry to S3. Now [1,2] is on majority (S1,S2,S3). If S1 crashes now - S5 can become leader (its term 3 > term 2) and overwrites the term-2 entry! Fix: S1 must first append a term-4 entry - only when [1,2,4] is on majority is everything safe.
An entry is committed as soon as the leader appends it to its own log
An entry is committed only when replicated to a majority (N/2 + 1) of cluster nodes
If the leader considered entries committed immediately, a leader crash could cause data loss that the client already received confirmation for. Quorum guarantees that data survives the death of any minority.
A follower receives AppendEntries with prevLogIndex=5, prevLogTerm=3, but its entry at index=5 has term=2. What happens?
Safety Properties and Real Implementations
Raft proves five Safety properties from which algorithm correctness follows. The key guarantee: under any failure scenario, a committed entry is never lost or overwritten.
| Property | Guarantee |
|---|---|
| Election Safety | At most one leader per term |
| Leader Append-Only | A leader never deletes or overwrites its own log |
| Log Matching | If two logs agree on (index, term) - all preceding entries are identical |
| Leader Completeness | If an entry is committed in term T - it is in the log of all leaders with term > T |
| State Machine Safety | All nodes apply the same command at each log index |
**Joint Consensus** - mechanism for safe membership changes (adding/removing nodes). A transitional C_old,new configuration is created, requiring majority in both sets. Simpler approach: **single-server changes** - change one node at a time, no transitional config needed.
| Implementation | Language | Used in |
|---|---|---|
| etcd/raft | Go | Kubernetes, CockroachDB |
| hashicorp/raft | Go | Consul, Nomad, Vault |
| Ratis | Java | Apache Ozone |
| tikv/raft-rs | Rust | TiKV |
| NuRaft | C++ | eBay NuKV |
Production optimizations: **Batching** (multiple commands in one AppendEntries), **Pipelining** (send next batch before receiving ACK), **Pre-vote** (preliminary round before elections, prevents disruption from partitioned nodes), **Learners** (non-voting members for catch-up before joining), **Snapshotting** (log compaction into a snapshot as it grows).
Raft is a simplified version of Paxos with weaker guarantees
Raft provides the same safety guarantees as Multi-Paxos, but with a fundamentally different decomposition of the problem
Raft trades some theoretical optimality (e.g., Paxos can accept entries out of order) for understandability. The safety properties are formally equivalent, as proven in Ongaro's dissertation.
Why does the Log Matching property allow the leader to find the divergence point with a lagging follower?
Вопросы для размышления
- etcd in Kubernetes stores all cluster state and is a bottleneck at scale. How does the Raft quorum affect write latency per operation, and why does Kubernetes recommend an odd number of etcd nodes?
Связанные уроки
- dist-08-paxos — Raft is understood by contrast with Paxos
- ds-06-leader-election — Raft integrates leader election as a core sub-protocol
- dist-06-ordering — Raft log implements total order broadcast
- dist-10-byzantine — Raft assumes crash faults; BFT requires stronger protocols
- ds-11-distributed-locks — Lock services like etcd use Raft internally
- alg-12-bfs