Distributed Systems
Sharding
Цели урока
- Explain why sharding exists and how it differs from replication
- Compare hash, range, and directory strategies and name the trade-off of each
- Choose the right sharding key for a given access pattern
- Describe rebalancing strategies and why fixed partitions outperform naive hash % N
Предварительные знания
- Understanding of replication (leader-follower, eventual consistency)
- Basic familiarity with hash functions (MD5, the concept of consistent hashing)
- CAP theorem: why partition tolerance forces a choice between consistency and availability
Instagram in 2012: 30 million users, 150 million photos, 13 engineers - and PostgreSQL sharded by hand. Sharding was not a luxury; it was a survival requirement.
- **Cassandra (Netflix)** - consistent hash sharding, each shard replicated x3, stores petabytes of user data
- **Kafka** - partitions (shards) are the unit of parallelism; each consumer group reads each partition independently
- **Redis Cluster** - 16384 hash slots distributed across nodes, manual resharding via CLI
- **CockroachDB** - auto-split on exceeding 512 MB, Raft per range, global ACID transactions
- **X (formerly Twitter)** - fan-out-on-write: on publish a tweet is written into every follower's timeline shard (denormalization instead of scatter-gather)
Consistent Hashing and the birth of distributed storage
In 1997 Karger, Lehman, Leighton, Panigrahy, Levine, and Lewin published 'Consistent Hashing and Random Trees' at STOC. The problem was straightforward: minimize data movement when a server is added or removed from a distributed cache. The hash-ring solution turned out to be so universal that 27 years later it powers Cassandra, DynamoDB, Riak, Akamai CDN, and thousands of other systems with virtually no changes.
What sharding is and why it exists
**2012. Instagram sells to Facebook for USD 1 billion. At the time of the deal: 30 million users, 150 million photos - all on PostgreSQL sharded by hand by a team of 13 engineers.** A single database instance could no longer absorb the write load. Sharding was the only path forward.
**Sharding** (partitioning) - splitting data into subsets (shards), each stored on a separate server. Sharding and partitioning are synonyms. MongoDB calls it sharding, Kafka calls it partitioning, PostgreSQL has table partitioning.
| Scenario | Single server | 10 shards |
|---|---|---|
| Data volume | Capped by hardware (10-50 TB) | 10x capacity with no upgrade |
| Queries per second | Bottleneck ~10-50K QPS | Linear growth with shard count |
| Fault tolerance | Single point of failure | Shard failure = partial data loss |
| Geo-latency | One location | Data closer to the user |
Sharding does not replace replication - each shard is typically replicated for durability. These are two independent scaling dimensions: replication handles availability and read throughput, sharding handles data volume and write throughput.
| Component | Role | Data |
|---|---|---|
| Client | Sends request | - |
| Router / Coordinator | Routes by user_id range | - |
| Shard 1 | Users A-F | 1/3 load |
| Shard 2 | Users G-M | 1/3 load |
| Shard 3 | Users N-Z | 1/3 load |
**Do not shard prematurely.** Sharding is a serious architectural shift: cross-shard joins are gone, transactions become distributed (2PC), monitoring grows complex. Exhaust replicas, caching, and indexes first - then consider sharding.
Sharding and replication are the same thing with different names
These are two distinct mechanisms: replication duplicates data for durability and read scale, sharding splits data for write scale and capacity
Production systems use both: each shard typically has 2-3 replicas. Cassandra, MongoDB, and HBase all work this way.
Which of the following problems is solved by sharding but NOT by replication?
Sharding strategies: Hash, Range, Directory
**Three sharding strategies - three different answers to the question: which shard holds this data?** The choice determines everything: query performance, operational complexity, and the cost of rebalancing when the shard count changes.
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Hash | shard = hash(key) % N | Even distribution | Resharding when N changes - almost all data moves |
| Range | A-M on shard1, N-Z on shard2 | Efficient range queries | Hot spots with skewed distribution |
| Directory | A separate lookup service knows where everything is | Maximum flexibility | SPOF, extra latency per lookup |
**Consistent Hashing** solves the resharding problem: adding one shard moves only 1/N of data instead of (N-1)/N. That is why Cassandra, DynamoDB, and Redis Cluster all use consistent hashing. Covered in a dedicated lesson.
Range sharding: time-series logs
ClickHouse stores logs with date-based range partitioning: shard_2024_01 holds January, shard_2024_02 holds February, etc. Queries like 'logs from the last week' hit 1-2 shards. Old shards turn read-only and can be moved to cheaper HDDs. New writes always go to the current shard - creating a hot spot that is mitigated by keeping the current shard on the fastest hardware.
**Directory-based sharding** is flexible, but the lookup service becomes a SPOF and latency bottleneck. If the lookup layer is unavailable, the entire system stalls. Facebook used this approach for MySQL and addressed the failure risk by replicating the lookup layer itself.
A team adds a 4th shard to a system with 3 shards using simple hash % N. What percentage of data must be migrated?
Choosing a sharding key and cross-shard operations
**The sharding key is the most important architectural decision in a sharded system.** The wrong choice leads to hot spots (one shard handles 90% of load while others idle) or scatter-gather (every query fans out to all shards). Both outcomes eliminate the benefits of sharding.
| Criterion | What it means | Example |
|---|---|---|
| High cardinality | Many unique values for even distribution | user_id good, status (3 values) bad |
| Access pattern | Queries must include the sharding key, otherwise scatter-gather | If searching by email - key is email, not user_id |
| Avoid hot spots | No skew in load distribution | Timestamp is bad - all new writes hit one shard |
| Immutable | Changing the key means moving data between shards | user_id better than username (username changes) |
Scatter-Gather: the cost of a query without a sharding key
Query: SELECT * FROM orders WHERE total > 50000 - does not include the sharding key (user_id). Without the key the router sends the query to all N shards (scatter), waits for all responses, merges results (gather). Latency = max(latency of all shards). Load = N x cost of a single-shard query. With sharding key: WHERE user_id = 42 AND total > 50000 - goes to one shard. Latency is minimal.
| Data | Good key | Bad key |
|---|---|---|
| Users | user_id | country (skew - US holds 40% of users) |
| Orders | user_id | status (3 values = 3 shards) |
| Messages | conversation_id | timestamp (hot spot on current) |
| Logs | timestamp + source | level (3-5 values) |
**Cross-shard transactions** require 2PC (two-phase commit) - expensive and fragile. Design data so transactions stay within a single shard. Rule: entities that change together should live on the same shard (co-location).
**Secondary indexes with sharding**: Local Index - index lives on a single shard, writes are atomic, but a query without the sharding key triggers scatter-gather. Global Index - one index knows where everything is, queries are efficient, but writes must update the global index (eventual consistency). DynamoDB Global Secondary Index works exactly this way.
Any attribute can be the sharding key as long as other queries have indexes
Rebalancing: adding and removing shards
**Data grows, shards fill up, servers fail - data must be redistributed without downtime.** Rebalancing is the most painful operation in a sharded system. Instagram executed several such migrations live under 5 million requests per minute.
| Strategy | Principle | Advantage |
|---|---|---|
| Fixed partitions | Many small partitions (e.g. 1000) distributed across servers. New server - move some partitions to it. | Minimum data movement (~10% when adding a server) |
| Dynamic partitioning | Overloaded shard splits in half. Underloaded shard merges with a neighbor. | Automatic balance, no manual management. HBase, MongoDB |
| Partition by node | Fixed number of partitions per node. New node steals partitions from existing ones. | Predictable partition size. Cassandra |
**Consistent Hashing** implements fixed partitions via virtual nodes on a hash ring. Cassandra: each physical node owns multiple ring ranges (vnodes). Adding a node transfers a subset of vnodes.
Sharding in real systems
MongoDB: automatic rebalancer moves chunks (64 MB default) between shards when the imbalance exceeds 8 chunks. Cassandra: consistent hash + vnodes, auto-rebalance on node add/remove. Redis Cluster: 16384 hash slots, manual resharding via redis-cli --cluster. CockroachDB: range-based, automatic split when a range exceeds 512 MB, Raft consensus per range.
- **Anti-pattern: sharding too early** - overhead without necessity. Replicas and caching first.
- **Anti-pattern: poor sharding key** - leads to hot spots or scatter-gather on every request.
- **Anti-pattern: cross-shard joins** - slow and fragile. Denormalize or redesign the access pattern.
- **Anti-pattern: ignoring locality** - related data on different shards means many network round-trips.
- **Anti-pattern: fixed shard count** - hard to scale. Use consistent hashing or many small partitions.
Sharding by hash(user_id) automatically produces uniform load
Hash sharding balances data volume, not request load: hot keys (top influencers, viral posts) overwhelm a single shard even with an ideal hash function
Real activity distributions follow a power law (Zipf): 1% of users generate 50%+ of requests. The fix is a composite key (user_id, bucket_id) with hot-key splitting, or a separate path through caches/replicas for hot tenants. Production systems (Discord, Slack) explicitly split large servers from small servers into different sharding strategies.
MongoDB uses fixed partitions (chunks) instead of naive hash % N. The primary reason:
Connection to previous lessons
Replication solves availability and read throughput but not the write bottleneck. Once a working set exceeds one server's RAM, horizontal partitioning is the only remaining axis.
- Replication — scales reads and provides HA, yet writes still pass through a single master
- Partitioning strategies — range, hash, directory - three ways to route a request to the correct shard in O(1)
- Consistent hashing — next lesson - adding a node without migrating the full dataset
Summary
- The choice of sharding key determines everything: a good key yields uniform distribution and locality for typical queries, a bad one produces hot shards and scatter-gather on every request
- Range partitions support efficient range queries but require adaptive splitting (HBase/CockroachDB), otherwise hot ranges appear at the end of the range under monotonic keys
- Hash partitions distribute uniformly but kill range queries: scatter-gather across all shards becomes the default read pattern
- Directory-based sharding (lookup tables) offers maximum flexibility but introduces a SPOF in the directory service and requires client-side caching
- Fixed-partitions schemes (e.g., 16384 slots in Redis Cluster) decouple the number of shards from the number of nodes: rebalancing moves only the needed chunks instead of recomputing every hash
Вопросы для размышления
- Consider a messaging system. Which sharding key makes more sense: sender user_id, recipient user_id, or conversation_id? Which queries become efficient, and which would require scatter-gather?
Связанные уроки
- dist-15-consistent-hashing — Consistent hashing is the basis of hash sharding
- ds-04-consistent-hashing — Minimizes key movement on rebalance
- dist-11-replication — Each shard is typically replicated independently
- ds-05-replication — Replication and sharding are orthogonal scale axes
- dist-07-2pc — Cross-shard transactions require 2PC
- dist-12-consistency — Sharding changes the consistency model of queries
- db-23-sharding