Distributed Systems
Leader Election
Цели урока
- Understand why a cluster needs a single leader and what problems it solves
- Know the mechanics of the Bully Algorithm and its limitations
- Explain lease-based election and the role of the Fencing Token
- Understand what split-brain is and how Quorum and Fencing Token prevent it
Предварительные знания
- Distributed systems concepts: nodes, failures, partition
- Basic understanding of consensus and replication
- CAP theorem and the CP vs AP trade-off
Kubernetes manages millions of containers - and surviving the death of a master node in 30 seconds is made possible by leader election in etcd. Without this algorithm, failover would require hours of manual work. See consensus and service discovery.
- **etcd (Kubernetes)** - Raft leader election: etcd master crashes, a new one is elected in 150-300ms, cluster keeps running
- **Kafka 4.0 (March 2025)** - ZooKeeper fully removed, KRaft is the default for all clusters; migration goes through the Kafka 3.9 bridge release
- **Patroni (PostgreSQL HA)** - lease in etcd/Consul: automatic PostgreSQL primary failover in 10-30 seconds
- **MongoDB replica set** - elections on primary failure: driver automatically switches to the new primary
- **GitHub 2013** - split-brain in PostgreSQL cluster: 6 hours of data loss due to incorrect Pacemaker configuration
Garcia-Molina and the Bully Algorithm (1982)
Hector Garcia-Molina (Stanford) published the Bully Algorithm in 1982 in the paper 'Elections in a Distributed Computing System'. The idea is elegantly simple: the highest ID always wins, so no voting is needed - only announcement. The algorithm retains educational value to this day, although production systems moved to Raft and lease-based election long ago. Garcia-Molina is also known for creating WebBase - one of the first large web crawlers, a predecessor to Google.
Why a Cluster Needs a Single Leader
**2020. A Kubernetes cluster at Netflix is running 200,000 pods. The etcd master crashes. Within 30 seconds, a new master is coordinating the cluster - with zero manual intervention.** This is not magic - it is Raft leader election built into etcd. Without a leader election algorithm, a cluster failure turns into anarchy: every node makes conflicting decisions.
In a distributed system without a leader, two nodes may concurrently update the same record to different values. Who is right? Without a coordinating node, nobody knows. The leader solves this: all writes go through it, and it defines the order.
- **Consensus without a leader** - O(n^2) messages per decision. With a leader - O(n).
- **Concurrent operations** - the leader serializes them into a single linearizable order.
- **Cluster metadata** - who is alive, who crashed, how to redistribute data - all through the leader.
| System | What the Leader Does | Election Algorithm |
|---|---|---|
| etcd | Accepts all writes, replicates to followers | Raft |
| Kafka 4.0+ | Partition leader accepts producer writes | KRaft (Raft; ZooKeeper removed in 2025) |
| PostgreSQL (Patroni) | Primary accepts write requests | etcd/Consul lease |
| Redis Sentinel | Master accepts writes, sentinel votes for failover | Quorum vote |
| Elasticsearch | Master-eligible node manages shards and mappings | Zen discovery / Raft |
**Formal requirements for a leader election algorithm:** Safety - at most one leader at any time (otherwise split-brain). Liveness - the system will eventually elect a leader (otherwise the cluster is stuck). Fairness (optional) - any node can win, not just a predetermined favorite.
A leader is only needed for performance - to avoid a write bottleneck
A leader is needed primarily for correctness - serializing concurrent operations and maintaining consistent cluster state
Without a leader, two nodes can simultaneously decide to reassign the same shard to different servers. The result is data loss. Correctness comes first, performance is secondary.
Why is the Safety guarantee (at most one leader) critical in leader election?
Bully Algorithm: The Simplest Election
The Bully Algorithm (Garcia-Molina, Harvard, 1982) elects the node with the highest ID. The name reflects the mechanics - the highest-ranked node 'bullies' the others. It is used in MongoDB replica set elections as one factor, and in early versions of Hadoop YARN.
Election Protocol
| Message | Sender | Meaning |
|---|---|---|
| ELECTION | Node with lower ID | Starting election - is anyone higher alive? |
| OK | Node with higher ID | Alive, joining the election - the lower ID is overruled |
| COORDINATOR | Winner | I am the new leader - take note |
**Bully complexity:** O(n^2) messages in the worst case - every node sends ELECTION to all higher nodes. With 1000 nodes that is one million messages. This is why production systems use Raft or lease-based election instead of Bully.
A cluster has 5 nodes with IDs 1-5. The leader (ID=5) crashes. Node 2 detects this first. Who becomes the leader under the Bully Algorithm?
Lease-based Election: The Production Approach
**etcd (Kubernetes), Consul (HashiCorp), ZooKeeper (HBase, Kafka before 4.0)** - all use lease-based election. The idea: leadership is a distributed lock with a TTL. The first node to acquire the lock becomes the leader. When the TTL expires without renewal, leadership is automatically released. Kubernetes has been built on this mechanism since 2016.
**Fencing Token** - a monotonically increasing number embedded in the lease. The storage layer rejects writes with a token lower than the last seen value. This protects against zombie leaders: a crashed leader that recovers and still believes it holds the lease will have its writes rejected, because the new leader already issued a higher token.
If the leader hangs for 5 seconds, the cluster has no leader for exactly those 5 seconds
With a correctly tuned TTL and renewal interval, a new leader is elected before the old lease expires
Typical settings: TTL=10s, renewal every 3.3s. The leader misses 3 renewals (10s) and the lease expires. A new candidate acquires the lease in ~100ms. Total downtime is around 10 seconds, not 5 minutes of manual intervention.
Why does lease-based election use a Fencing Token on writes?
Split-Brain: When Two Leaders Appear
**2013, GitHub. A PostgreSQL cluster managed by Pacemaker. The network splits into two segments. Pacemaker in each segment decides the other primary is dead - and promotes its own. Two primaries start accepting writes. 6 hours of data loss before detection.** This is split-brain - when two nodes simultaneously believe themselves to be the leader.
| Cause of Split-Brain | Scenario | Consequence |
|---|---|---|
| Network partition | Datacenter A and B lose connectivity | Each DC elects its own leader |
| GC pause | Leader freezes for 15s (stop-the-world GC) | New leader elected, old one wakes up - two leaders |
| Clock skew | NTP desynchronizes cluster clocks | Lease timeout fires incorrectly |
| Incorrect TTL | TTL too short for a loaded node | False failovers while leader is healthy |
Protection 1: Quorum
The leader can write only after receiving acknowledgment from a quorum (majority) of nodes. During a network partition, the smaller segment lacks quorum and cannot elect a leader. Raft and Paxos are built on this principle: with 5 nodes, 3 confirmations are required. If the cluster splits 2+3, only the side with 3 nodes can elect a leader.
Protection 2: Fencing Token in Storage
Quorum protects against split-brain during elections. Fencing Token protects against a zombie leader that is no longer leader but still attempts to write. Together they provide robust protection: even if the election algorithm makes a mistake, the storage layer will reject the stale write.
Split-brain can be prevented simply by detecting leader failure faster
Split-brain is a consequence of network partition, not slow detection. Quorum is the only reliable defense
Fast detection speeds up the election of a new leader, but does not solve the problem of two network segments. If both segments detect the problem quickly, they will both quickly elect their own leaders. Only quorum guarantees that only the majority side can elect a leader.
A 5-node cluster loses network connectivity and splits into groups of 2 and 3. How does quorum protect against split-brain?
Key Takeaways
- **A single leader** is needed to serialize operations: without it, consistent state in a distributed system is impossible
- **Bully Algorithm** - elects by maximum ID; simple, but generates O(n²) messages and is not split-brain tolerant
- **Lease-based Election** - leader holds a lease with TTL and renews periodically; missed renewal triggers failover
- **Split-Brain** - two nodes believe they are leader; resolved via fencing token, STONITH, or quorum
- **Lease TTL** - balance: too short = frequent false failovers, too long = slow recovery
- **Production approach** (etcd, ZooKeeper, Consul) - uses Raft/Paxos on top of leases for reliable guarantees
Вопросы для размышления
- Choosing a TTL for a lease is a non-trivial problem. A TTL that is too short causes frequent false failovers (the leader is alive but failed to renew in time). A TTL that is too long causes long downtime on a real failure. How would one determine the optimal TTL for a specific cluster?
Связанные уроки
- ds-03-consensus — Leader election is a special case of consensus
- ds-02-cap-theorem — CAP constraints apply to election protocols
- dist-08-paxos — Paxos uses leader election as a sub-protocol
- dist-09-raft — Raft integrates leader election with log replication
- dist-10-byzantine — Byzantine election must handle lying nodes
- alg-12-bfs