Distributed Systems

Leader Election

Цели урока

Understand why a cluster needs a single leader and what problems it solves
Know the mechanics of the Bully Algorithm and its limitations
Explain lease-based election and the role of the Fencing Token
Understand what split-brain is and how Quorum and Fencing Token prevent it

Предварительные знания

Distributed systems concepts: nodes, failures, partition
Basic understanding of consensus and replication
CAP theorem and the CP vs AP trade-off

Kubernetes manages millions of containers - and surviving the death of a master node in 30 seconds is made possible by leader election in etcd. Without this algorithm, failover would require hours of manual work. See consensus and service discovery.

**etcd (Kubernetes)** - Raft leader election: etcd master crashes, a new one is elected in 150-300ms, cluster keeps running
**Kafka 4.0 (March 2025)** - ZooKeeper fully removed, KRaft is the default for all clusters; migration goes through the Kafka 3.9 bridge release
**Patroni (PostgreSQL HA)** - lease in etcd/Consul: automatic PostgreSQL primary failover in 10-30 seconds
**MongoDB replica set** - elections on primary failure: driver automatically switches to the new primary
**GitHub 2013** - split-brain in PostgreSQL cluster: 6 hours of data loss due to incorrect Pacemaker configuration

Garcia-Molina and the Bully Algorithm (1982)

Hector Garcia-Molina (Stanford) published the Bully Algorithm in 1982 in the paper 'Elections in a Distributed Computing System'. The idea is elegantly simple: the highest ID always wins, so no voting is needed - only announcement. The algorithm retains educational value to this day, although production systems moved to Raft and lease-based election long ago. Garcia-Molina is also known for creating WebBase - one of the first large web crawlers, a predecessor to Google.

Why a Cluster Needs a Single Leader

**2020. A Kubernetes cluster at Netflix is running 200,000 pods. The etcd master crashes. Within 30 seconds, a new master is coordinating the cluster - with zero manual intervention.** This is not magic - it is Raft leader election built into etcd. Without a leader election algorithm, a cluster failure turns into anarchy: every node makes conflicting decisions.

In a distributed system without a leader, two nodes may concurrently update the same record to different values. Who is right? Without a coordinating node, nobody knows. The leader solves this: all writes go through it, and it defines the order.

**Consensus without a leader** - O(n^2) messages per decision. With a leader - O(n).
**Concurrent operations** - the leader serializes them into a single linearizable order.
**Cluster metadata** - who is alive, who crashed, how to redistribute data - all through the leader.

System	What the Leader Does	Election Algorithm
etcd	Accepts all writes, replicates to followers	Raft
Kafka 4.0+	Partition leader accepts producer writes	KRaft (Raft; ZooKeeper removed in 2025)
PostgreSQL (Patroni)	Primary accepts write requests	etcd/Consul lease
Redis Sentinel	Master accepts writes, sentinel votes for failover	Quorum vote
Elasticsearch	Master-eligible node manages shards and mappings	Zen discovery / Raft

**Formal requirements for a leader election algorithm:** Safety - at most one leader at any time (otherwise split-brain). Liveness - the system will eventually elect a leader (otherwise the cluster is stuck). Fairness (optional) - any node can win, not just a predetermined favorite.

A leader is only needed for performance - to avoid a write bottleneck

A leader is needed primarily for correctness - serializing concurrent operations and maintaining consistent cluster state

Without a leader, two nodes can simultaneously decide to reassign the same shard to different servers. The result is data loss. Correctness comes first, performance is secondary.

Why is the Safety guarantee (at most one leader) critical in leader election?

Bully Algorithm: The Simplest Election

The Bully Algorithm (Garcia-Molina, Harvard, 1982) elects the node with the highest ID. The name reflects the mechanics - the highest-ranked node 'bullies' the others. It is used in MongoDB replica set elections as one factor, and in early versions of Hadoop YARN.

Election Protocol

Message	Sender	Meaning
ELECTION	Node with lower ID	Starting election - is anyone higher alive?
OK	Node with higher ID	Alive, joining the election - the lower ID is overruled
COORDINATOR	Winner	I am the new leader - take note

**Bully complexity:** O(n^2) messages in the worst case - every node sends ELECTION to all higher nodes. With 1000 nodes that is one million messages. This is why production systems use Raft or lease-based election instead of Bully.

A cluster has 5 nodes with IDs 1-5. The leader (ID=5) crashes. Node 2 detects this first. Who becomes the leader under the Bully Algorithm?

Lease-based Election: The Production Approach

**etcd (Kubernetes), Consul (HashiCorp), ZooKeeper (HBase, Kafka before 4.0)** - all use lease-based election. The idea: leadership is a distributed lock with a TTL. The first node to acquire the lock becomes the leader. When the TTL expires without renewal, leadership is automatically released. Kubernetes has been built on this mechanism since 2016.

**Fencing Token** - a monotonically increasing number embedded in the lease. The storage layer rejects writes with a token lower than the last seen value. This protects against zombie leaders: a crashed leader that recovers and still believes it holds the lease will have its writes rejected, because the new leader already issued a higher token.

If the leader hangs for 5 seconds, the cluster has no leader for exactly those 5 seconds

With a correctly tuned TTL and renewal interval, a new leader is elected before the old lease expires

Typical settings: TTL=10s, renewal every 3.3s. The leader misses 3 renewals (10s) and the lease expires. A new candidate acquires the lease in ~100ms. Total downtime is around 10 seconds, not 5 minutes of manual intervention.

Why does lease-based election use a Fencing Token on writes?

Split-Brain: When Two Leaders Appear

**2013, GitHub. A PostgreSQL cluster managed by Pacemaker. The network splits into two segments. Pacemaker in each segment decides the other primary is dead - and promotes its own. Two primaries start accepting writes. 6 hours of data loss before detection.** This is split-brain - when two nodes simultaneously believe themselves to be the leader.

Cause of Split-Brain	Scenario	Consequence
Network partition	Datacenter A and B lose connectivity	Each DC elects its own leader
GC pause	Leader freezes for 15s (stop-the-world GC)	New leader elected, old one wakes up - two leaders
Clock skew	NTP desynchronizes cluster clocks	Lease timeout fires incorrectly
Incorrect TTL	TTL too short for a loaded node	False failovers while leader is healthy

Protection 1: Quorum

The leader can write only after receiving acknowledgment from a quorum (majority) of nodes. During a network partition, the smaller segment lacks quorum and cannot elect a leader. Raft and Paxos are built on this principle: with 5 nodes, 3 confirmations are required. If the cluster splits 2+3, only the side with 3 nodes can elect a leader.

Protection 2: Fencing Token in Storage

Quorum protects against split-brain during elections. Fencing Token protects against a zombie leader that is no longer leader but still attempts to write. Together they provide robust protection: even if the election algorithm makes a mistake, the storage layer will reject the stale write.

Split-brain can be prevented simply by detecting leader failure faster

Split-brain is a consequence of network partition, not slow detection. Quorum is the only reliable defense

Fast detection speeds up the election of a new leader, but does not solve the problem of two network segments. If both segments detect the problem quickly, they will both quickly elect their own leaders. Only quorum guarantees that only the majority side can elect a leader.

A 5-node cluster loses network connectivity and splits into groups of 2 and 3. How does quorum protect against split-brain?

Key Takeaways

**A single leader** is needed to serialize operations: without it, consistent state in a distributed system is impossible
**Bully Algorithm** - elects by maximum ID; simple, but generates O(n²) messages and is not split-brain tolerant
**Lease-based Election** - leader holds a lease with TTL and renews periodically; missed renewal triggers failover
**Split-Brain** - two nodes believe they are leader; resolved via fencing token, STONITH, or quorum
**Lease TTL** - balance: too short = frequent false failovers, too long = slow recovery
**Production approach** (etcd, ZooKeeper, Consul) - uses Raft/Paxos on top of leases for reliable guarantees

Вопросы для размышления

Choosing a TTL for a lease is a non-trivial problem. A TTL that is too short causes frequent false failovers (the leader is alive but failed to renew in time). A TTL that is too long causes long downtime on a real failure. How would one determine the optimal TTL for a specific cluster?

Связанные уроки

ds-03-consensus — Leader election is a special case of consensus
ds-02-cap-theorem — CAP constraints apply to election protocols
dist-08-paxos — Paxos uses leader election as a sub-protocol
dist-09-raft — Raft integrates leader election with log replication
dist-10-byzantine — Byzantine election must handle lying nodes
alg-12-bfs