Distributed Systems

Gossip Protocols

Цели урока

Understand gossip mechanics: exponential spread in O(log N) rounds
Distinguish push, pull, and push-pull modes and know when to apply each
Explain SWIM protocol and why indirect ping via helpers is needed
Understand the role of Merkle Tree in anti-entropy replica synchronization

Предварительные знания

Replication and eventual consistency (ds-05-replication)
Failure detection and heartbeat patterns
Basic data structures: hash map, binary tree

A Cassandra cluster of 1000 nodes - every 7 seconds the entire cluster knows about any state change. No coordinator. No broadcast. Just gossip.

**Cassandra** - gossip every 1000 ms for membership and schema propagation, SWIM for failure detection
**Consul (HashiCorp)** - SWIM-based membership, service discovery across clusters of thousands of services
**Redis Cluster** - gossip for shard state propagation and failure detection
**CockroachDB** - gossip for liveness detection and replication metadata
**Amazon DynamoDB (2007)** - Merkle Tree anti-entropy for repairing replicas after partitions

Gossip protocols: from biology to distributed systems

In 1987, Alan Demers and his team at Xerox PARC published "Epidemic algorithms for replicated database maintenance" - a paper that brought the epidemic spread model into computer science. The idea: if a virus infects each person who infects k random others, all N people are infected in log(N/log(N)) steps. The same math applies to data. The protocols in Cassandra and Consul today are direct descendants of that paper.

Gossip mechanics: epidemic spread in O(log N) rounds

**Cassandra cluster, 6 nodes, production 2019. Two nodes failed overnight. By morning the entire cluster knows - no broadcast was sent, no central coordinator exists, no shared state.** This is gossip protocol: every 1000 ms each node picks 3 random peers and exchanges state. In log2(N) rounds the information covers the entire cluster. This is a base pattern of eventual consistency.

**Propagation math:** with fanout=3 and N=1000 nodes, log3(1000) ≈ 7 rounds are enough for full coverage. Each round: 3^k nodes reached. At k=7 that is 2187 - more than 1000, so 100% coverage. Time: 7 seconds at 1-second interval.

Property	Value	Why it matters
Complexity	O(log N) rounds	1000 nodes = 7 rounds; 1 000 000 = 20 rounds
Fault tolerance	Works under p% failures	No single point of failure, no coordinator
Scalability	Constant load per node	Each node makes exactly fanout requests regardless of N
Eventual consistency	Seconds, not milliseconds	Good for membership, not for transactions

Gossip is broadcast: one source sends data to everyone

Gossip is decentralized: each node forwards to random peers. There is no single source after round 1.

In broadcast the source creates O(N) connections - a bottleneck at N=1000. In gossip each node makes exactly fanout (3-5) requests regardless of cluster size. Load is evenly distributed.

Cluster of 1000 nodes, gossip interval 1 sec, fanout 3. How long does it take to propagate an update?

Push, Pull, Push-Pull: three exchange modes

**The exchange mode determines who initiates data transfer.** This is not a stylistic choice: push converges slower, pull overloads when versions differ widely, push-pull is optimal for most production systems.

Mode	How it works	When to use
Push	Node A sends its data to node B	Fast initial propagation of new data
Pull	Node A requests data from node B	Discovering missed updates
Push-Pull	A sends digest, B responds with diff, both sync	Production: minimal traffic + full sync

**Why send a digest, not the data itself?** A node can store gigabytes. On each gossip round only a digest is sent: a map of keys and version numbers. That is kilobytes instead of gigabytes. Full data is transferred only on a specific request after version comparison.

Node A starts a gossip round with node B. A sends digest: {user:5, config:3}. B replies: I have {user:4, config:5}. What happens next?

SWIM: failure detection via gossip

**HashiCorp Consul, 2014. SWIM protocol is added for membership (see service discovery). Result: failure detection time drops from 30 seconds (heartbeat timeout) to 5 seconds - with zero load on any coordinator.** SWIM (Scalable Weakly-consistent Infection-style Membership) solves the false positive problem: a slow node is not a dead node.

**Why indirect ping?** Network partitions can be partial: A cannot reach B but C can reach B fine. If only A makes the verdict - false positive is likely. Indirect ping through 3 different helpers reduces error probability to statistically negligible.

System	SWIM / gossip usage
Cassandra	Membership + schema changes, gossip every 1000 ms
Consul (HashiCorp)	SWIM-based membership, service discovery at thousands of services
Redis Cluster	Node state propagation across shards
CockroachDB	Liveness detection and replication metadata
HashiCorp Serf	Membership protocol, the foundation of Consul

If a node does not respond to ping it should immediately be marked dead

A slow response, network partition, GC pause all cause timeout. Indirect ping distinguishes a temporary hiccup from a real failure.

JVM GC stop-the-world can last 500+ ms. PostgreSQL autovacuum sometimes freezes IO. In both cases direct ping gives timeout while indirect ping through helpers shows the node is alive. SWIM was specifically designed to minimize false positives in such scenarios.

Why does SWIM use indirect ping via helpers instead of simply retrying the direct ping?

Anti-Entropy: Merkle Trees for replica synchronization

**Amazon DynamoDB 2007, the Dynamo whitepaper. Problem: after a network partition 2 replicas diverge. How to find differences between 10 million keys without comparing all of them?** Anti-entropy via Merkle Tree: compare hashes of subtrees, not the data itself. O(log N) comparisons instead of O(N). Dynamo still uses this approach.

**Merkle Tree** is a binary hash tree. Leaves = hashes of individual keys. Parents = hash of concatenated child hashes. Root = fingerprint of the entire dataset. If roots match - data is identical. If not - recursively descend the branch with a different hash until the specific keys are found.

Approach	Diff search complexity	Traffic at N=1M keys
Full comparison	O(N)	Transfer all 1M keys
Merkle Tree	O(log N + diff)	log2(1M) = 20 requests, then only the diff
Bloom filter	O(N) false positives	More compact but with errors

**Anti-entropy via gossip works for eventual consistency, not strong consistency.** If the system requires linearizable reads (etcd, Spanner) - gossip is too slow: data can diverge for seconds. Gossip is ideal for membership, config propagation, CRDT-based data - where eventual consistency is sufficient.

Gossip protocols are suitable for any data in a distributed system

Gossip is an eventual consistency tool. Financial transactions, leader elections, cluster configuration require consensus protocols (Raft, Paxos).

Gossip guarantees that data will eventually propagate - but not when. Cassandra uses gossip for membership (who is alive) and schema changes. For writes with consistency guarantees - quorum writes with W+R>N (see replication), not gossip.

After a partition, 2 Cassandra replicas diverge on 500 keys out of 10 million. How does Merkle Tree help during anti-entropy sync?

Вопросы для размышления

Gossip provides eventual consistency with a delay of seconds. For which data in a production system is that sufficient, and for which is it categorically not? What criterion allows making that determination?

Связанные уроки

ds-10-crdts — Gossip + CRDT gives coordination-free eventual consistency
ds-12-service-discovery — Consul SWIM protocol uses gossip for membership
dist-12-consistency — Gossip achieves only eventual consistency
alg-12-bfs