Distributed Systems

Raft

Цели урока

Understand the three node roles and transitions between them
Explain the leader election algorithm and the role of randomized timeout
Describe log replication via AppendEntries
Know the five Safety properties and why entries from foreign terms cannot be committed directly
Navigate real implementations (etcd, hashicorp/raft)

Предварительные знания

Understanding of the consensus problem in distributed systems
Basic knowledge of leader election (heartbeats, quorum)
Familiarity with replicated state machine

5 of 8 independent Paxos implementations contained safety bugs (CMU, 2014). Raft (a Paxos successor) was designed with one goal: make consensus understandable enough to be implemented correctly the first time.

**Kubernetes** uses etcd/raft to store all control plane state
**CockroachDB** replicates data via Raft per-range (each key range is its own Raft cluster)
**Consul** elects the service cluster leader via hashicorp/raft
**TiKV** (TiDB storage) uses tikv/raft-rs in Rust for ACID transactions
**Vault** (HashiCorp) stores secrets in a Raft cluster with encryption

Diego Ongaro and the Birth of Raft

Diego Ongaro defended his dissertation at Stanford in 2014 under John Ousterhout. The central thesis: understandability should be a first-class design goal, not a side effect. The authors conducted a user study: after instruction, students were tested on both algorithms. Raft results were significantly better (p < 0.05). The name Raft stands for "Reliable, Replicated, Redundant, Fault-Tolerant" - and is also the raft used to cross the Paxos river.

Node Roles and the Term Concept

**2014. Diego Ongaro, Stanford. Dissertation: "In Search of an Understandable Consensus Algorithm".** Paxos had existed since 1989, but was so complex that most implementations contained bugs - a CMU study found that 5 of 8 independent Paxos implementations had safety errors. Raft was the answer: same guarantees, half the concepts. Today Kubernetes, Consul, CockroachDB, and TiKV all run on Raft - the consensus problem solved practically.

Every node in a Raft cluster is always in one of three states: **Leader**, **Follower**, or **Candidate**. At any given moment there is at most one leader in the cluster.

State	Role	What it does
Leader	Manages the cluster	Accepts client requests, replicates the log, sends heartbeats
Follower	Passive node	Responds to leader and candidate RPCs, never initiates anything
Candidate	Contender	Temporary state during elections - attempts to become Leader

**Term** - a logical time period in Raft. Each term starts with an election. Terms increase monotonically and never decrease. If a node sees a term greater than its own - it immediately updates and reverts to Follower. At most one Leader per term.

If a Follower receives no heartbeat within the election timeout (default 150-300 ms) - it assumes the leader has died and transitions to Candidate, starting a new term.

Term in Raft is physical time, like a second counter

Term is a logical counter that increments monotonically with each new election. It has no connection to wall-clock time.

Raft deliberately avoids depending on clock synchronization. Terms serve to determine information freshness: higher term = more up-to-date data.

A Raft node receives AppendEntries with term=5, but its own currentTerm=7. What happens?

Leader Election

When the election timeout expires, a Follower becomes a Candidate: increments currentTerm, votes for itself, and broadcasts RequestVote to all nodes. If the candidate receives votes from the majority (N/2 + 1) - it becomes Leader and immediately sends heartbeats to stop other elections.

Key voting condition: the candidate's log must be **at least as up-to-date** as the voter's log. This guarantees the leader will have the most recent data. Comparison: first by lastLogTerm (higher = fresher), then by lastLogIndex (longer = fresher).

**Split vote** - when multiple candidates start simultaneously and nobody reaches majority. Solution: **randomized election timeout** (each node waits a random 150-300 ms). The first to wake up collects votes before others start.

Partition: what happens to the old leader

Cluster of 5 nodes. A network split isolates the leader (Leader-A) from 3 nodes. The majority (3/5) elects a new Leader-B with term=6. Leader-A continues accepting requests but cannot commit any - no quorum. Clients get timeouts. When the network heals - Leader-A sees term=6 > its own term, instantly reverts to Follower. All uncommitted entries from Leader-A get overwritten.

In a 5-node cluster, a split vote occurred: 2 candidates each got 2 votes. What happens next?

Log Replication

The Replicated Log is the heart of Raft. Each entry contains: **index** (position), **term** (which term it was created in), and **command** (for the state machine). The leader accepts a command from a client, appends it to its own log, and sends AppendEntries to all followers. An entry is considered **committed** when replicated to the majority of nodes.

AppendEntries field	Purpose
prevLogIndex / prevLogTerm	Consistency check: follower must have an entry at this index with this term
entries[]	New entries to append (empty array = heartbeat)
leaderCommit	Leader tells follower up to which index it can apply entries to state machine

**Critical commit rule:** the leader only commits entries from **its own term**. Entries from previous terms are committed implicitly alongside the current term's entry. This protects against the Figure 8 scenario: without this rule, a committed entry could be overwritten on leader change.

Figure 8: why old-term entries cannot be committed directly

Term 2: S1 (leader) replicates entry to S2. Logs: S1=[1,2], S2=[1,2], S3=[1]. Term 3: S5 becomes leader, appends term-3 entry. S5=[1,3]. Term 4: S1 is leader again, replicates term-2 entry to S3. Now [1,2] is on majority (S1,S2,S3). If S1 crashes now - S5 can become leader (its term 3 > term 2) and overwrites the term-2 entry! Fix: S1 must first append a term-4 entry - only when [1,2,4] is on majority is everything safe.

An entry is committed as soon as the leader appends it to its own log

An entry is committed only when replicated to a majority (N/2 + 1) of cluster nodes

If the leader considered entries committed immediately, a leader crash could cause data loss that the client already received confirmation for. Quorum guarantees that data survives the death of any minority.

A follower receives AppendEntries with prevLogIndex=5, prevLogTerm=3, but its entry at index=5 has term=2. What happens?

Safety Properties and Real Implementations

Raft proves five Safety properties from which algorithm correctness follows. The key guarantee: under any failure scenario, a committed entry is never lost or overwritten.

Property	Guarantee
Election Safety	At most one leader per term
Leader Append-Only	A leader never deletes or overwrites its own log
Log Matching	If two logs agree on (index, term) - all preceding entries are identical
Leader Completeness	If an entry is committed in term T - it is in the log of all leaders with term > T
State Machine Safety	All nodes apply the same command at each log index

**Joint Consensus** - mechanism for safe membership changes (adding/removing nodes). A transitional C_old,new configuration is created, requiring majority in both sets. Simpler approach: **single-server changes** - change one node at a time, no transitional config needed.

Implementation	Language	Used in
etcd/raft	Go	Kubernetes, CockroachDB
hashicorp/raft	Go	Consul, Nomad, Vault
Ratis	Java	Apache Ozone
tikv/raft-rs	Rust	TiKV
NuRaft	C++	eBay NuKV

Production optimizations: **Batching** (multiple commands in one AppendEntries), **Pipelining** (send next batch before receiving ACK), **Pre-vote** (preliminary round before elections, prevents disruption from partitioned nodes), **Learners** (non-voting members for catch-up before joining), **Snapshotting** (log compaction into a snapshot as it grows).

Raft is a simplified version of Paxos with weaker guarantees

Raft provides the same safety guarantees as Multi-Paxos, but with a fundamentally different decomposition of the problem

Raft trades some theoretical optimality (e.g., Paxos can accept entries out of order) for understandability. The safety properties are formally equivalent, as proven in Ongaro's dissertation.

Why does the Log Matching property allow the leader to find the divergence point with a lagging follower?

Вопросы для размышления

etcd in Kubernetes stores all cluster state and is a bottleneck at scale. How does the Raft quorum affect write latency per operation, and why does Kubernetes recommend an odd number of etcd nodes?

Связанные уроки

dist-08-paxos — Raft is understood by contrast with Paxos
ds-06-leader-election — Raft integrates leader election as a core sub-protocol
dist-06-ordering — Raft log implements total order broadcast
dist-10-byzantine — Raft assumes crash faults; BFT requires stronger protocols
ds-11-distributed-locks — Lock services like etcd use Raft internally
alg-12-bfs