Distributed Systems

Sharding

Цели урока

Explain why sharding exists and how it differs from replication
Compare hash, range, and directory strategies and name the trade-off of each
Choose the right sharding key for a given access pattern
Describe rebalancing strategies and why fixed partitions outperform naive hash % N

Предварительные знания

Understanding of replication (leader-follower, eventual consistency)
Basic familiarity with hash functions (MD5, the concept of consistent hashing)
CAP theorem: why partition tolerance forces a choice between consistency and availability

Instagram in 2012: 30 million users, 150 million photos, 13 engineers - and PostgreSQL sharded by hand. Sharding was not a luxury; it was a survival requirement.

**Cassandra (Netflix)** - consistent hash sharding, each shard replicated x3, stores petabytes of user data
**Kafka** - partitions (shards) are the unit of parallelism; each consumer group reads each partition independently
**Redis Cluster** - 16384 hash slots distributed across nodes, manual resharding via CLI
**CockroachDB** - auto-split on exceeding 512 MB, Raft per range, global ACID transactions
**X (formerly Twitter)** - fan-out-on-write: on publish a tweet is written into every follower's timeline shard (denormalization instead of scatter-gather)

Consistent Hashing and the birth of distributed storage

In 1997 Karger, Lehman, Leighton, Panigrahy, Levine, and Lewin published 'Consistent Hashing and Random Trees' at STOC. The problem was straightforward: minimize data movement when a server is added or removed from a distributed cache. The hash-ring solution turned out to be so universal that 27 years later it powers Cassandra, DynamoDB, Riak, Akamai CDN, and thousands of other systems with virtually no changes.

What sharding is and why it exists

**2012. Instagram sells to Facebook for USD 1 billion. At the time of the deal: 30 million users, 150 million photos - all on PostgreSQL sharded by hand by a team of 13 engineers.** A single database instance could no longer absorb the write load. Sharding was the only path forward.

**Sharding** (partitioning) - splitting data into subsets (shards), each stored on a separate server. Sharding and partitioning are synonyms. MongoDB calls it sharding, Kafka calls it partitioning, PostgreSQL has table partitioning.

Scenario	Single server	10 shards
Data volume	Capped by hardware (10-50 TB)	10x capacity with no upgrade
Queries per second	Bottleneck ~10-50K QPS	Linear growth with shard count
Fault tolerance	Single point of failure	Shard failure = partial data loss
Geo-latency	One location	Data closer to the user

Sharding does not replace replication - each shard is typically replicated for durability. These are two independent scaling dimensions: replication handles availability and read throughput, sharding handles data volume and write throughput.

Component	Role	Data
Client	Sends request	-
Router / Coordinator	Routes by user_id range	-
Shard 1	Users A-F	1/3 load
Shard 2	Users G-M	1/3 load
Shard 3	Users N-Z	1/3 load

**Do not shard prematurely.** Sharding is a serious architectural shift: cross-shard joins are gone, transactions become distributed (2PC), monitoring grows complex. Exhaust replicas, caching, and indexes first - then consider sharding.

Sharding and replication are the same thing with different names

These are two distinct mechanisms: replication duplicates data for durability and read scale, sharding splits data for write scale and capacity

Production systems use both: each shard typically has 2-3 replicas. Cassandra, MongoDB, and HBase all work this way.

Which of the following problems is solved by sharding but NOT by replication?

Sharding strategies: Hash, Range, Directory

**Three sharding strategies - three different answers to the question: which shard holds this data?** The choice determines everything: query performance, operational complexity, and the cost of rebalancing when the shard count changes.

Strategy	How it works	Pros	Cons
Hash	shard = hash(key) % N	Even distribution	Resharding when N changes - almost all data moves
Range	A-M on shard1, N-Z on shard2	Efficient range queries	Hot spots with skewed distribution
Directory	A separate lookup service knows where everything is	Maximum flexibility	SPOF, extra latency per lookup

**Consistent Hashing** solves the resharding problem: adding one shard moves only 1/N of data instead of (N-1)/N. That is why Cassandra, DynamoDB, and Redis Cluster all use consistent hashing. Covered in a dedicated lesson.

Range sharding: time-series logs

ClickHouse stores logs with date-based range partitioning: shard_2024_01 holds January, shard_2024_02 holds February, etc. Queries like 'logs from the last week' hit 1-2 shards. Old shards turn read-only and can be moved to cheaper HDDs. New writes always go to the current shard - creating a hot spot that is mitigated by keeping the current shard on the fastest hardware.

**Directory-based sharding** is flexible, but the lookup service becomes a SPOF and latency bottleneck. If the lookup layer is unavailable, the entire system stalls. Facebook used this approach for MySQL and addressed the failure risk by replicating the lookup layer itself.

A team adds a 4th shard to a system with 3 shards using simple hash % N. What percentage of data must be migrated?

Choosing a sharding key and cross-shard operations

**The sharding key is the most important architectural decision in a sharded system.** The wrong choice leads to hot spots (one shard handles 90% of load while others idle) or scatter-gather (every query fans out to all shards). Both outcomes eliminate the benefits of sharding.

Criterion	What it means	Example
High cardinality	Many unique values for even distribution	user_id good, status (3 values) bad
Access pattern	Queries must include the sharding key, otherwise scatter-gather	If searching by email - key is email, not user_id
Avoid hot spots	No skew in load distribution	Timestamp is bad - all new writes hit one shard
Immutable	Changing the key means moving data between shards	user_id better than username (username changes)

Scatter-Gather: the cost of a query without a sharding key

Query: SELECT * FROM orders WHERE total > 50000 - does not include the sharding key (user_id). Without the key the router sends the query to all N shards (scatter), waits for all responses, merges results (gather). Latency = max(latency of all shards). Load = N x cost of a single-shard query. With sharding key: WHERE user_id = 42 AND total > 50000 - goes to one shard. Latency is minimal.

Data	Good key	Bad key
Users	user_id	country (skew - US holds 40% of users)
Orders	user_id	status (3 values = 3 shards)
Messages	conversation_id	timestamp (hot spot on current)
Logs	timestamp + source	level (3-5 values)

**Cross-shard transactions** require 2PC (two-phase commit) - expensive and fragile. Design data so transactions stay within a single shard. Rule: entities that change together should live on the same shard (co-location).

**Secondary indexes with sharding**: Local Index - index lives on a single shard, writes are atomic, but a query without the sharding key triggers scatter-gather. Global Index - one index knows where everything is, queries are efficient, but writes must update the global index (eventual consistency). DynamoDB Global Secondary Index works exactly this way.

Any attribute can be the sharding key as long as other queries have indexes

Rebalancing: adding and removing shards

**Data grows, shards fill up, servers fail - data must be redistributed without downtime.** Rebalancing is the most painful operation in a sharded system. Instagram executed several such migrations live under 5 million requests per minute.

Strategy	Principle	Advantage
Fixed partitions	Many small partitions (e.g. 1000) distributed across servers. New server - move some partitions to it.	Minimum data movement (~10% when adding a server)
Dynamic partitioning	Overloaded shard splits in half. Underloaded shard merges with a neighbor.	Automatic balance, no manual management. HBase, MongoDB
Partition by node	Fixed number of partitions per node. New node steals partitions from existing ones.	Predictable partition size. Cassandra

**Consistent Hashing** implements fixed partitions via virtual nodes on a hash ring. Cassandra: each physical node owns multiple ring ranges (vnodes). Adding a node transfers a subset of vnodes.

Sharding in real systems

MongoDB: automatic rebalancer moves chunks (64 MB default) between shards when the imbalance exceeds 8 chunks. Cassandra: consistent hash + vnodes, auto-rebalance on node add/remove. Redis Cluster: 16384 hash slots, manual resharding via redis-cli --cluster. CockroachDB: range-based, automatic split when a range exceeds 512 MB, Raft consensus per range.

**Anti-pattern: sharding too early** - overhead without necessity. Replicas and caching first.
**Anti-pattern: poor sharding key** - leads to hot spots or scatter-gather on every request.
**Anti-pattern: cross-shard joins** - slow and fragile. Denormalize or redesign the access pattern.
**Anti-pattern: ignoring locality** - related data on different shards means many network round-trips.
**Anti-pattern: fixed shard count** - hard to scale. Use consistent hashing or many small partitions.

Sharding by hash(user_id) automatically produces uniform load

Hash sharding balances data volume, not request load: hot keys (top influencers, viral posts) overwhelm a single shard even with an ideal hash function

Real activity distributions follow a power law (Zipf): 1% of users generate 50%+ of requests. The fix is a composite key (user_id, bucket_id) with hot-key splitting, or a separate path through caches/replicas for hot tenants. Production systems (Discord, Slack) explicitly split large servers from small servers into different sharding strategies.

MongoDB uses fixed partitions (chunks) instead of naive hash % N. The primary reason:

Connection to previous lessons

Replication solves availability and read throughput but not the write bottleneck. Once a working set exceeds one server's RAM, horizontal partitioning is the only remaining axis.

Replication — scales reads and provides HA, yet writes still pass through a single master
Partitioning strategies — range, hash, directory - three ways to route a request to the correct shard in O(1)
Consistent hashing — next lesson - adding a node without migrating the full dataset

Summary

The choice of sharding key determines everything: a good key yields uniform distribution and locality for typical queries, a bad one produces hot shards and scatter-gather on every request
Range partitions support efficient range queries but require adaptive splitting (HBase/CockroachDB), otherwise hot ranges appear at the end of the range under monotonic keys
Hash partitions distribute uniformly but kill range queries: scatter-gather across all shards becomes the default read pattern
Directory-based sharding (lookup tables) offers maximum flexibility but introduces a SPOF in the directory service and requires client-side caching
Fixed-partitions schemes (e.g., 16384 slots in Redis Cluster) decouple the number of shards from the number of nodes: rebalancing moves only the needed chunks instead of recomputing every hash

Вопросы для размышления

Consider a messaging system. Which sharding key makes more sense: sender user_id, recipient user_id, or conversation_id? Which queries become efficient, and which would require scatter-gather?

Связанные уроки

dist-15-consistent-hashing — Consistent hashing is the basis of hash sharding
ds-04-consistent-hashing — Minimizes key movement on rebalance
dist-11-replication — Each shard is typically replicated independently
ds-05-replication — Replication and sharding are orthogonal scale axes
dist-07-2pc — Cross-shard transactions require 2PC
dist-12-consistency — Sharding changes the consistency model of queries
db-23-sharding