Distributed Systems
Data Replication
Цели урока
- Explain the difference between replication and backup and why one does not replace the other
- Compare Leader-Follower, Multi-Leader and Leaderless by latency, consistency and complexity
- Calculate quorum configurations N/W/R for given requirements
- Choose a conflict resolution strategy for a specific use case
Предварительные знания
- CAP theorem: understanding the trade-off between consistency and availability
- Basic understanding of network latency and partition tolerance
- Familiarity with transactions and ACID guarantees
GitLab lost 300 GB of production data in 5 minutes. Backup existed but did not restore - nobody checked it. Replication is not a magic bullet but a discipline with trade-offs at every step. See CAP theorem and Raft for context.
- **PostgreSQL streaming replication** - standard for high load: primary + 2 hot standby, failover in seconds via Patroni or repmgr
- **Amazon DynamoDB** - leaderless with tunable consistency: eventual by default, strongly consistent reads at 2x latency cost
- **Cassandra at Netflix** - RF=3 with LOCAL_QUORUM: survived an entire AWS region outage without downtime thanks to multi-DC replication
- **Google Spanner** - synchronous replication via Paxos with TrueTime API: strong consistency globally at the cost of 7-14 ms commit wait
- **MySQL semi-sync at Facebook** - at most one lost transaction guaranteed on leader failure (at least one replica always confirmed)
Amazon Dynamo and the birth of NoSQL
In 2007 Amazon published the Dynamo paper - the system managing shopping carts at peak Black Friday load. The key decision: abandon single leader in favor of availability. Conflicts are resolved by the client (application-level). The paper sparked a wave of NoSQL databases: Cassandra (Facebook, 2008), Riak, Voldemort. Quorum ideas and consistent hashing from Dynamo are used in every modern distributed database.
Why store the same data in three places
**GitLab, January 31, 2017. 23:23 UTC. An engineer deleted 300 GB of production database - wrong rsync direction. Backup scripts had been silently failing for 6 months - nobody verified them. Result: 5,000 projects lost, 6 hours of downtime.** The problem was not unreliable hardware - it was a single copy of data.
**Replication** - storing identical copies of data on multiple independent nodes. Three goals: fault tolerance (node fails - data survives), low latency (read from the nearest node), read scaling (10 replicas = 10x read throughput).
| System | Replicas | What they gain |
|---|---|---|
| Amazon S3 | 3+ in one region, 3+ across regions | 11 nines durability - losing a file is less likely than winning the lottery |
| PostgreSQL streaming | 1 primary + N standby | Failover in seconds, read scaling via hot standby |
| Cassandra | RF=3 by default | Survive an entire datacenter failure without data loss |
| Kafka | replication.factor=3 | Consumer keeps reading when a broker fails |
The main challenge in replication is not storing copies - it is **synchronizing changes**. A client writes to Leader in DC West. How and when does that appear on Follower in DC East? The strategy chosen determines everything: latency, consistency and availability, and behavior during network partition.
Replication = backup. With 3 replicas the data is safe.
Replication and backup are different things. Replicas hold the current state and will also apply DELETE or corruption. Backup is a separate point-in-time copy.
Running DROP TABLE causes all 3 replicas to apply that operation within seconds. Replication protects against hardware failure, not against operator error or bugs. Both replicas and point-in-time backup in separate storage are needed.
GitLab lost production data even though backup scripts were configured. Which replication principle was violated?
Leader-Follower: one writes, all read
**The most common model:** one node (Leader, Primary, Master) accepts all writes. Others (Followers, Replicas, Standby) receive changes through a replication log and serve read requests. PostgreSQL, MySQL, MongoDB replica set, Redis Sentinel - all use this pattern.
Synchronous vs Asynchronous replication
| Mode | Write latency | Guarantee | Loss on failover | Example |
|---|---|---|---|---|
| Synchronous | High (+roundtrip to replica) | Strong: follower confirmed before client ACK | 0 transactions | PostgreSQL synchronous_commit=on |
| Asynchronous | Low (local commit) | Eventual: replica lags by N operations | Up to several seconds of data | MySQL async, Redis AOF async |
| Semi-synchronous | Medium | 1 replica confirmed before ACK | Minimal | MySQL semi-sync, PostgreSQL remote_apply |
**Replication lag** is a fundamental problem of asynchronous replication. The client writes to leader, immediately reads from follower - sees stale data. Solutions: sticky sessions (read from the same replica that received the write), `read-own-writes` consistency (the client reads recent writes back from the leader), monitoring lag via `pg_stat_replication.write_lag`.
PostgreSQL is configured with asynchronous replication. The leader accepted a transaction and crashed 200 ms later. What happens to the data?
Multi-Leader and conflicts: when one leader is the bottleneck
**The single leader problem:** all writes go through one node. With datacenters in Tokyo, Dublin, and Virginia, a user in Tokyo waits for a roundtrip to Virginia on every write (170+ ms). Multi-Leader solves this: each DC has its own leader, clients write to the nearest one.
Conflicts: the unavoidable cost of Multi-Leader
Classic write conflict
DC West: user.name = "Alice" (timestamp 10:00:00.100). DC East: user.name = "Bob" (timestamp 10:00:00.150). Both DCs applied the write locally. When replicas synchronize - whose data is correct? Timestamps do not help: clocks across servers drift by tens of milliseconds.
| Strategy | Mechanism | When to use |
|---|---|---|
| Last Write Wins (LWW) | Write with the highest timestamp wins | Cassandra default. Simple but risks data loss |
| First Write Wins | First write wins, subsequent are rejected | Unique username registration - first one matters |
| Merge / CRDT | Values merged by data type rules | Counters, sets, documents like Google Docs |
| Custom application logic | Conflict passed to business logic | Shopping cart (Amazon Dynamo), versioned documents |
**In practice:** Multi-Leader is rarely used within a single DC - single leader is simpler and faster there. Main uses: geographic distribution (multiple regions), offline clients (Google Docs, CouchDB - each client acts as leader while offline), collaborative editing.
Multi-Leader is better than Single Leader - more leaders means more throughput
Multi-Leader adds conflict complexity. It is suitable only for geographic distribution or offline clients.
Within a single datacenter, Single Leader is simpler and gives predictable consistency. Multi-Leader is needed when latency to a single leader is unacceptable (50+ ms cross-region roundtrip) or when the system must work offline.
Google Docs lets two users edit a document simultaneously without locks. What conflict resolution approach is used?
Leaderless and quorums: Amazon Dynamo against convention
**Amazon, 2007. Werner Vogels publishes the Dynamo paper.** No leader at all - the client writes to N nodes simultaneously and waits for confirmation from W of them. On reads, it queries R nodes and picks the freshest value. When W+R>N the sets overlap: at least one of the R nodes saw the last write. Cassandra, Riak, and DynamoDB are all built on this idea - they place replicas on a hash ring.
| Node | Action | Result |
|---|---|---|
| Node 1 | Receives write request | ACK |
| Node 2 | Receives write request | ACK |
| Node 3 | Down | Write fails |
| Total | Quorum W=2 reached (2 of 3 ACK) | Write succeeds |
**Quorum equation:** N = number of replicas, W = write quorum, R = read quorum. Guarantee: when W + R > N, at least one node in R has seen the latest write. Typical configurations: N=3, W=2, R=2 (balanced); N=3, W=3, R=1 (maximum write durability); N=3, W=1, R=3 (maximum write speed).
| Model | Strengths | Weaknesses | Examples |
|---|---|---|---|
| Leader-Follower | Simplicity, strong consistency, ordering guaranteed | Single write bottleneck, failover requires election | PostgreSQL, MySQL, MongoDB RS |
| Multi-Leader | Low latency across regions, offline work | Conflicts are unavoidable, complex consistency | CouchDB, Google Docs (OT), Cassandra multi-DC |
| Leaderless | High availability, no SPOF, tunable consistency | Eventual consistency, read repair overhead, versioning needed | Cassandra, Riak, DynamoDB |
W+R>N guarantees strong consistency like a single server
W+R>N only guarantees that at least one of the read nodes saw the latest write - but picking the correct value still requires versioning.
When two nodes return different versions, a mechanism is needed to determine which is fresher: vector clocks, timestamps (unreliable without clock sync), or CRDT semantics. Without this, W+R>N does not prevent reading stale data.
Cassandra is configured with N=5, W=3, R=3. How many nodes can fail without losing the ability to read data?
Вопросы для размышления
- When designing a user session storage system (one million active sessions, 5-minute TTL) - which replication model and quorum configuration to choose and why? What matters more: never losing a session or never delaying a response?
Связанные уроки
- ds-02-cap-theorem — CAP constrains what replication can guarantee
- ds-03-consensus — Consensus needed for synchronous replication safety
- ds-04-clocks — Clocks determine replication lag measurement
- dist-07-2pc — 2PC is one way to make replication atomic
- dist-09-raft — Raft implements consensus-based log replication
- db-13-transactions