Distributed Systems

Consistent Hashing

Цели урока

  • Explain why hash % N destroys the cache on any cluster size change
  • Describe the hash-ring mechanism and O(log N) key lookup
  • Understand the role of virtual nodes in balancing load across heterogeneous servers
  • Know how replication layers on top of consistent hashing
  • Distinguish when to apply Consistent Hashing, Jump Hash, or Maglev

Предварительные знания

  • Hash functions: hashing, collisions, uniform distribution
  • Basic data structures: TreeMap / SortedSet, ceiling operation
  • Cluster basics: nodes, replication, fault tolerance

Adding one server to a 99-node cache cluster invalidates 99 percent of the cache - if the routing rule is hash mod N. Karger's 1997 fix was disarmingly simple: put the keys and the servers on the same circle, route each key clockwise to the nearest server, and only 1/N of the keys ever move. That single idea now runs Cassandra, DynamoDB, Discord's chat, and every CDN.

  • **Cassandra**: 256 vnodes per node by default; adding a node moves ~1/N of the data in parallel from every existing node.
  • **Discord**: 4 billion messages per day routed through a consistent-hashing cluster of Cassandra and ScyllaDB nodes; node replacements are non-events for users.
  • **Akamai CDN**: the original consistent-hashing customer in 1998 - the same algorithm now routes traffic across hundreds of thousands of edge servers worldwide.

Historical context

In 1997, David Karger and colleagues at MIT published 'Consistent Hashing and Random Trees' at STOC. The paper arose from a practical crisis: Akamai was building one of the first large-scale CDNs and modulo hashing was unusable - every server addition required rehashing the entire cache. Karger's insight was to map both servers and keys onto a shared integer ring so that adding a server moves only the keys between it and its predecessor. Amazon adopted this for Dynamo in 2007; Cassandra, Discord, and Redis Cluster use variants today.

The problem: hash % N breaks the cache

The naive approach to distributing keys across N servers is `hash(key) % N`. This works fine until the cluster size changes. Adding or removing even one server changes N, and nearly all keys map to different servers. For a cache cluster, this means a mass cache miss storm - every evicted key hits the database at once.

**Real cost of modulo remapping**: Discord switched from a Redis cluster with modulo routing to consistent hashing after a scale-out event caused a cascade of cache misses that propagated into database overload. Cassandra, DynamoDB, and Redis Cluster all use consistent hashing specifically to avoid this failure mode.

hash % N is a fine default; the cache just reloads for a few seconds.

In a production cluster of 100+ nodes this becomes minutes of degradation with 10x load on the database on every cluster change.

A node crash is also a change in N. When one server out of 100 dies, all 100 nodes start receiving remapped keys at the same time. It happens at the worst possible moment, when the system is already under stress from the failure.

A cache cluster of 99 servers is expanded to 100. How many keys change their assigned server when using hash % N?

Hash ring: keys go to the nearest node

Consistent hashing maps both servers and keys onto a circular space of integers (the 'ring'). Each key is assigned to the first server clockwise from its position. When a server is added, only the keys between it and its predecessor move. When a server is removed, only its keys move - to the next server clockwise. On average, only `1/N` of all keys move for any single cluster change.

**MurmurHash3** is the hash function of choice for consistent hashing implementations (Cassandra, Guava ConsistentHashFunction). It distributes keys uniformly across the ring and is fast - no cryptographic overhead. MD5 or SHA-1 would also distribute well but are slower than necessary for non-security purposes.

A cluster has 5 nodes on a hash ring. One node fails. Which keys are affected?

Virtual nodes: balance without luck

With a small number of physical nodes on the ring, random placement leads to **uneven load distribution** - one node might own 40% of keys while another owns 10%. Virtual nodes (vnodes) solve this: each physical node occupies **multiple** positions on the ring. With 150-256 vnodes per server, load variance drops from ~40% to ~2-3%. Vnodes also enable **weighted distribution** - a server with twice the RAM gets twice the vnodes.

**Cassandra uses 256 vnodes per node** by default (`num_tokens=256`). This means a 3-node cluster has 768 points on the ring. When a 4th node joins, it claims 256 positions spread evenly across the ring, so each existing node donates roughly 1/4 of its vnodes - and with them, 1/4 of its data. Migration is parallel and balanced across all donors.

More vnodes is always better; dial it up to 1000+.

256 vnodes is the practical ceiling. At 1000 vnodes across 100 nodes the ring holds 100,000 points, lookup slows down, and memory overhead grows without a meaningful improvement in balance.

Imbalance decays as ~1/sqrt(vnodes). At 100 vnodes it is already ~2%. At 1000 it drops to ~0.6%. The gain is negligible, the cost is 10x more memory and slower ring operations on every join, leave, and gossip exchange.

Why use 150-256 vnodes per node instead of 1 position, even though lookup becomes slightly slower?

Replication on the ring and alternatives

Consistent hashing also drives **replication**: Cassandra stores each key on the next `RF` (replication factor) distinct nodes clockwise from the key's position. With RF=3, losing one node means the data survives on 2 remaining replicas. The coordinator reads or writes to a quorum of replicas (`QUORUM = RF/2 + 1 = 2`) for strong consistency.

**Alternatives to consistent hashing**: Rendezvous hashing (highest random weight) achieves the same 1/N movement guarantee with simpler code and no ring data structure - each client independently computes `HRW(key, server_i)` and picks the maximum. Redis Cluster uses a different approach: 16384 fixed hash slots assigned to nodes, with explicit slot maps. Simpler to reason about but requires coordination to move slots.

With replication factor=3 on a ring of 10 nodes, one node fails. What happens to the data it held?

Key ideas

  • **Modulo routing fails on resize**: changing N reshuffles ~(N-1)/N of all keys; one new server invalidates almost the entire cache.
  • **Hash ring**: map keys and servers to a 32-bit integer space; each key goes to the next server clockwise. Adding a server moves only its slice (~1/N).
  • **Virtual nodes**: 150-256 vnodes per server cuts load imbalance from ~40% to ~2-3% and supports weighted distribution for heterogeneous hardware.
  • **Replication**: store each key on the next RF distinct nodes clockwise. A node failure leaves RF-1 replicas; the cluster repairs in the background.
  • **Alternatives**: Rendezvous Hashing (HRW) gives the same guarantees with less code; Redis Cluster uses 16384 fixed hash slots.

Вопросы для размышления

  • Cassandra defaults to 256 vnodes; some operators reduce this to 16. What problems arise from very large vnode counts (gossip overhead, repair time) that argue against the default?
  • Rendezvous Hashing requires no shared ring data structure - each client computes the placement independently. Why has consistent hashing won mindshare despite Rendezvous being algorithmically simpler?
  • Redis Cluster uses 16384 fixed slots rather than a hash ring. What operational properties does this fixed-slot approach trade for the dynamic flexibility of pure consistent hashing?

Связанные уроки

  • sd-04-database — Cassandra and DynamoDB use consistent hashing for partition routing across nodes
  • ds-04-clocks — Cassandra combines consistent hashing with Lamport timestamps for LWW conflict resolution
  • cloud-04 — EC2 Auto Scaling changes the node count - consistent hashing minimizes cache invalidation during scale-out
  • devops-04 — Container orchestration uses consistent hashing to route service requests across replica pods
  • se-04 — The hash ring is Open/Closed: adding or removing nodes without touching the routing logic
  • ds-01-intro
  • ds-05-replication
  • ds-09-gossip-protocols
  • opt-04
  • db-23-sharding
Consistent Hashing

0

1

Sign In