Distributed Systems
Gossip Protocols
Цели урока
- Explain the epidemic propagation principle and O(log N) gossip convergence
- Distinguish Push, Pull, and Push-Pull modes and their trade-offs
- Understand the SWIM failure detection algorithm with indirect pinging
- Know the role of anti-entropy and Merkle trees for long-term replica consistency
Предварительные знания
- Concept of eventual consistency and replication
- Basic familiarity with the CAP theorem
- Understanding of hash functions
A Cassandra cluster of 100 nodes: a table schema change propagates to all nodes in 30 seconds with no coordinator and no broadcast storm. Epidemic mathematics in production.
- **Cassandra** - gossip for membership and schema changes, industry standard
- **Consul** - SWIM on top of gossip for service discovery at millions of services
- **Redis Cluster** - gossip for propagating topology changes between nodes
- **HashiCorp Serf** - gossip as a standalone embedded membership library
- **CockroachDB** - liveness detection via SWIM for distributed SQL
Xerox PARC, 1987
Gossip protocols were described by Alan Demers at Xerox PARC in 1987 in the paper "Epidemic Algorithms for Replicated Database Maintenance". The authors noticed that propagating updates in databases is mathematically equivalent to epidemic spreading in a population - and applied the SIR model (susceptible-infected-recovered) to analyze convergence. SWIM appeared in 2002 at Cornell as a practical production-ready implementation.
Epidemic Information Spreading
**Cassandra, production cluster, 100 nodes. An engineer applies a schema change to one table. Within 30 seconds all 100 nodes know about it - no central server, no broadcast storm, no coordinator.** This is gossip: every second each node picks 3 random peers and exchanges state. The mathematics of epidemics just works.
Gossip convergence speed: with fanout=3 and N nodes, information reaches everyone in O(log N) rounds. At 1,000 nodes - about 10 rounds = 10 seconds. At 1,000,000 nodes - about 20 rounds. This is exponential decay of ignorance.
| Propagation approach | Load per node | Fault tolerance | Examples |
|---|---|---|---|
| Broadcast (all at once) | O(N) messages | Low - single point of failure | UDP multicast, early systems |
| Master-slave replication | O(slaves) on master | Medium - master is a bottleneck | MySQL replication, Redis Sentinel |
| Gossip (epidemic) | O(fanout) per node | High - no coordinator | Cassandra, Consul, Redis Cluster |
Gossip is unreliable - some nodes may never receive the information
The probability of non-delivery decays exponentially. With fanout=3 and k=O(log N) rounds, the chance that a node did NOT receive the data approaches zero.
Each round a node independently picks random peers. Even if a node missed a message in round k, it will receive it from one of its three new random peers in round k+1. The math: P(did not receive) = (1 - 1/N)^(fanout*rounds) -> 0.
A gossip cluster of 1024 nodes with fanout=3. Roughly how many rounds does it take for news to reach everyone?
Push, Pull, and Push-Pull Modes
Gossip exchange comes in three flavors: Push (send own data), Pull (request from a peer), Push-Pull (bidirectional exchange). Cassandra uses Push-Pull - the most efficient mode for fast convergence.
| Mode | Messages per round | Convergence | When to use |
|---|---|---|---|
| Push | 1 (send digest) | Slower | Broadcasting event notifications |
| Pull | 2 (request + response) | Medium | Periodic state synchronization |
| Push-Pull | 3-4 (digest + bidirectional diff) | Fastest | Membership state, schema changes in Cassandra |
A digest is a compact summary of state: only keys and version numbers, no actual data. It typically fits in a few kilobytes even for thousands of keys. Full data is requested only when versions diverge.
Why does gossip send a digest (keys + versions) rather than the full data?
SWIM: Failure Detection
**SWIM** (Scalable Weakly-consistent Infection-style Membership, 2002) is the standard failure detection algorithm for gossip clusters. Used in Cassandra, Consul, and HashiCorp Serf. Key property: avoids false positives through indirect pinging.
The indirect ping solves the false positive problem: if there is a temporary network split between detector and target but the target is alive, one of the 3 peers will reach it. This dramatically reduces the false positive rate without increasing detection time for real failures.
If a node does not reply to a ping it can be declared dead immediately
An indirect ping through peers is required first; otherwise temporary network partitions cause false positives.
In real networks, brief connectivity loss between specific node pairs is common (BGP flap, switch overload). If the detector immediately declares the target dead, the cluster kicks off unnecessary rebalance operations. SWIM waits for confirmation from several independent paths before drawing conclusions.
Why does SWIM use indirect pinging through peers instead of simply retrying the direct ping?
Anti-Entropy and Merkle Trees
Gossip propagates new data quickly, but does not guarantee that replicas stay identical over time - hardware failures, GC pauses, and network partitions can create divergence. Anti-entropy is a background process for replica synchronization that periodically compares replicas and synchronizes the differences.
A Merkle tree is a hash tree: each leaf node holds the hash of a key's value; each intermediate node holds the hash of its children. Comparing two replicas requires O(log N) comparisons: descend the tree, pruning branches whose hashes match. Cassandra uses Merkle trees for repair operations.
| System | Gossip usage |
|---|---|
| Cassandra | Membership, schema changes, repair coordination |
| Consul | Service discovery, health checks, KV store |
| Redis Cluster | Node state, cluster topology propagation |
| CockroachDB | Liveness detection, range lease negotiation |
| HashiCorp Serf | Membership protocol as a standalone library |
Gossip protocols guarantee consistency like Paxos or Raft, only through a distributed algorithm
Gossip provides eventual consistency and probabilistic delivery, not consensus: there is no linearizability, no total order, only probabilistic convergence in O(log N) rounds
Gossip is AP in CAP terms, not CP. No node knows for certain whether a message reached everyone (only with high probability). That makes gossip an excellent choice for membership, failure detection, and metrics - and unsuitable for primitives like transactions or distributed locks. Cassandra uses Paxos for lightweight transactions on top of gossip membership exactly for this reason.
Why does anti-entropy use a Merkle tree instead of comparing all keys directly?
Connection to previous lessons
Consistent hashing and sharding say nothing about how nodes learn current membership. Gossip provides epidemic convergence in O(log N) without a central coordinator.
- Consistent hashing — defines ring-based data placement, not the source of cluster-membership knowledge
- Sharding — assumed a routing table exists somewhere and stays current
- Strongly-consistent registry — ZooKeeper-style alternative to gossip - requires centralisation and coordination
Summary
- Gossip protocols spread information through random peer-to-peer exchange: each round a node picks k random neighbors and exchanges state; convergence to the full cluster takes O(log N) rounds
- Push/pull/push-pull variants offer different trade-offs: pull is efficient under rare updates, push under frequent ones, push-pull combines both and is the default in most production systems
- SWIM (2002) augments gossip with explicit failure detection via ping/indirect-ping and a suspicion timer, separating transient network glitches from real failures
- Anti-entropy via Merkle trees lets two nodes locate data divergences using O(log N) hashes instead of shipping the whole dataset - used in Cassandra and Dynamo for reconciliation
- Gossip provides eventual consistency and probabilistic delivery, not consensus: it fits membership, metrics, and schema gossip, but not transactions or linearizable operations
Вопросы для размышления
- In a gossip cluster, some nodes deliberately falsify information about other nodes (Byzantine failure). How could the protocol be strengthened to detect such nodes?
Связанные уроки
- ds-09-gossip-protocols — Basic gossip; this lesson covers SWIM and anti-entropy
- dist-11-replication — Anti-entropy syncs replicas via Merkle trees
- ds-05-replication — Eventual repair of replicas runs over gossip
- dist-12-consistency — Gossip delivers eventual consistency of state
- ds-10-crdts — CRDTs frequently sync over gossip channels
- dist-10-byzantine — Byzantine-resilient gossip variants exist
- alg-12-bfs