Distributed Systems
Gossip Protocols
Цели урока
- Understand gossip mechanics: exponential spread in O(log N) rounds
- Distinguish push, pull, and push-pull modes and know when to apply each
- Explain SWIM protocol and why indirect ping via helpers is needed
- Understand the role of Merkle Tree in anti-entropy replica synchronization
Предварительные знания
- Replication and eventual consistency (ds-05-replication)
- Failure detection and heartbeat patterns
- Basic data structures: hash map, binary tree
A Cassandra cluster of 1000 nodes - every 7 seconds the entire cluster knows about any state change. No coordinator. No broadcast. Just gossip.
- **Cassandra** - gossip every 1000 ms for membership and schema propagation, SWIM for failure detection
- **Consul (HashiCorp)** - SWIM-based membership, service discovery across clusters of thousands of services
- **Redis Cluster** - gossip for shard state propagation and failure detection
- **CockroachDB** - gossip for liveness detection and replication metadata
- **Amazon DynamoDB (2007)** - Merkle Tree anti-entropy for repairing replicas after partitions
Gossip protocols: from biology to distributed systems
In 1987, Alan Demers and his team at Xerox PARC published "Epidemic algorithms for replicated database maintenance" - a paper that brought the epidemic spread model into computer science. The idea: if a virus infects each person who infects k random others, all N people are infected in log(N/log(N)) steps. The same math applies to data. The protocols in Cassandra and Consul today are direct descendants of that paper.
Gossip mechanics: epidemic spread in O(log N) rounds
**Cassandra cluster, 6 nodes, production 2019. Two nodes failed overnight. By morning the entire cluster knows - no broadcast was sent, no central coordinator exists, no shared state.** This is gossip protocol: every 1000 ms each node picks 3 random peers and exchanges state. In log2(N) rounds the information covers the entire cluster. This is a base pattern of eventual consistency.
**Propagation math:** with fanout=3 and N=1000 nodes, log3(1000) ≈ 7 rounds are enough for full coverage. Each round: 3^k nodes reached. At k=7 that is 2187 - more than 1000, so 100% coverage. Time: 7 seconds at 1-second interval.
| Property | Value | Why it matters |
|---|---|---|
| Complexity | O(log N) rounds | 1000 nodes = 7 rounds; 1 000 000 = 20 rounds |
| Fault tolerance | Works under p% failures | No single point of failure, no coordinator |
| Scalability | Constant load per node | Each node makes exactly fanout requests regardless of N |
| Eventual consistency | Seconds, not milliseconds | Good for membership, not for transactions |
Gossip is broadcast: one source sends data to everyone
Gossip is decentralized: each node forwards to random peers. There is no single source after round 1.
In broadcast the source creates O(N) connections - a bottleneck at N=1000. In gossip each node makes exactly fanout (3-5) requests regardless of cluster size. Load is evenly distributed.
Cluster of 1000 nodes, gossip interval 1 sec, fanout 3. How long does it take to propagate an update?
Push, Pull, Push-Pull: three exchange modes
**The exchange mode determines who initiates data transfer.** This is not a stylistic choice: push converges slower, pull overloads when versions differ widely, push-pull is optimal for most production systems.
| Mode | How it works | When to use |
|---|---|---|
| Push | Node A sends its data to node B | Fast initial propagation of new data |
| Pull | Node A requests data from node B | Discovering missed updates |
| Push-Pull | A sends digest, B responds with diff, both sync | Production: minimal traffic + full sync |
**Why send a digest, not the data itself?** A node can store gigabytes. On each gossip round only a digest is sent: a map of keys and version numbers. That is kilobytes instead of gigabytes. Full data is transferred only on a specific request after version comparison.
Node A starts a gossip round with node B. A sends digest: {user:5, config:3}. B replies: I have {user:4, config:5}. What happens next?
SWIM: failure detection via gossip
**HashiCorp Consul, 2014. SWIM protocol is added for membership (see service discovery). Result: failure detection time drops from 30 seconds (heartbeat timeout) to 5 seconds - with zero load on any coordinator.** SWIM (Scalable Weakly-consistent Infection-style Membership) solves the false positive problem: a slow node is not a dead node.
**Why indirect ping?** Network partitions can be partial: A cannot reach B but C can reach B fine. If only A makes the verdict - false positive is likely. Indirect ping through 3 different helpers reduces error probability to statistically negligible.
| System | SWIM / gossip usage |
|---|---|
| Cassandra | Membership + schema changes, gossip every 1000 ms |
| Consul (HashiCorp) | SWIM-based membership, service discovery at thousands of services |
| Redis Cluster | Node state propagation across shards |
| CockroachDB | Liveness detection and replication metadata |
| HashiCorp Serf | Membership protocol, the foundation of Consul |
If a node does not respond to ping it should immediately be marked dead
A slow response, network partition, GC pause all cause timeout. Indirect ping distinguishes a temporary hiccup from a real failure.
JVM GC stop-the-world can last 500+ ms. PostgreSQL autovacuum sometimes freezes IO. In both cases direct ping gives timeout while indirect ping through helpers shows the node is alive. SWIM was specifically designed to minimize false positives in such scenarios.
Why does SWIM use indirect ping via helpers instead of simply retrying the direct ping?
Anti-Entropy: Merkle Trees for replica synchronization
**Amazon DynamoDB 2007, the Dynamo whitepaper. Problem: after a network partition 2 replicas diverge. How to find differences between 10 million keys without comparing all of them?** Anti-entropy via Merkle Tree: compare hashes of subtrees, not the data itself. O(log N) comparisons instead of O(N). Dynamo still uses this approach.
**Merkle Tree** is a binary hash tree. Leaves = hashes of individual keys. Parents = hash of concatenated child hashes. Root = fingerprint of the entire dataset. If roots match - data is identical. If not - recursively descend the branch with a different hash until the specific keys are found.
| Approach | Diff search complexity | Traffic at N=1M keys |
|---|---|---|
| Full comparison | O(N) | Transfer all 1M keys |
| Merkle Tree | O(log N + diff) | log2(1M) = 20 requests, then only the diff |
| Bloom filter | O(N) false positives | More compact but with errors |
**Anti-entropy via gossip works for eventual consistency, not strong consistency.** If the system requires linearizable reads (etcd, Spanner) - gossip is too slow: data can diverge for seconds. Gossip is ideal for membership, config propagation, CRDT-based data - where eventual consistency is sufficient.
Gossip protocols are suitable for any data in a distributed system
Gossip is an eventual consistency tool. Financial transactions, leader elections, cluster configuration require consensus protocols (Raft, Paxos).
Gossip guarantees that data will eventually propagate - but not when. Cassandra uses gossip for membership (who is alive) and schema changes. For writes with consistency guarantees - quorum writes with W+R>N (see replication), not gossip.
After a partition, 2 Cassandra replicas diverge on 500 keys out of 10 million. How does Merkle Tree help during anti-entropy sync?
Вопросы для размышления
- Gossip provides eventual consistency with a delay of seconds. For which data in a production system is that sufficient, and for which is it categorically not? What criterion allows making that determination?
Связанные уроки
- ds-10-crdts — Gossip + CRDT gives coordination-free eventual consistency
- ds-12-service-discovery — Consul SWIM protocol uses gossip for membership
- dist-12-consistency — Gossip achieves only eventual consistency
- alg-12-bfs