Distributed Systems

Data Replication

Цели урока

Explain the difference between replication and backup and why one does not replace the other
Compare Leader-Follower, Multi-Leader and Leaderless by latency, consistency and complexity
Calculate quorum configurations N/W/R for given requirements
Choose a conflict resolution strategy for a specific use case

Предварительные знания

CAP theorem: understanding the trade-off between consistency and availability
Basic understanding of network latency and partition tolerance
Familiarity with transactions and ACID guarantees

GitLab lost 300 GB of production data in 5 minutes. Backup existed but did not restore - nobody checked it. Replication is not a magic bullet but a discipline with trade-offs at every step. See CAP theorem and Raft for context.

**PostgreSQL streaming replication** - standard for high load: primary + 2 hot standby, failover in seconds via Patroni or repmgr
**Amazon DynamoDB** - leaderless with tunable consistency: eventual by default, strongly consistent reads at 2x latency cost
**Cassandra at Netflix** - RF=3 with LOCAL_QUORUM: survived an entire AWS region outage without downtime thanks to multi-DC replication
**Google Spanner** - synchronous replication via Paxos with TrueTime API: strong consistency globally at the cost of 7-14 ms commit wait
**MySQL semi-sync at Facebook** - at most one lost transaction guaranteed on leader failure (at least one replica always confirmed)

Amazon Dynamo and the birth of NoSQL

In 2007 Amazon published the Dynamo paper - the system managing shopping carts at peak Black Friday load. The key decision: abandon single leader in favor of availability. Conflicts are resolved by the client (application-level). The paper sparked a wave of NoSQL databases: Cassandra (Facebook, 2008), Riak, Voldemort. Quorum ideas and consistent hashing from Dynamo are used in every modern distributed database.

Why store the same data in three places

**GitLab, January 31, 2017. 23:23 UTC. An engineer deleted 300 GB of production database - wrong rsync direction. Backup scripts had been silently failing for 6 months - nobody verified them. Result: 5,000 projects lost, 6 hours of downtime.** The problem was not unreliable hardware - it was a single copy of data.

**Replication** - storing identical copies of data on multiple independent nodes. Three goals: fault tolerance (node fails - data survives), low latency (read from the nearest node), read scaling (10 replicas = 10x read throughput).

System	Replicas	What they gain
Amazon S3	3+ in one region, 3+ across regions	11 nines durability - losing a file is less likely than winning the lottery
PostgreSQL streaming	1 primary + N standby	Failover in seconds, read scaling via hot standby
Cassandra	RF=3 by default	Survive an entire datacenter failure without data loss
Kafka	replication.factor=3	Consumer keeps reading when a broker fails

The main challenge in replication is not storing copies - it is **synchronizing changes**. A client writes to Leader in DC West. How and when does that appear on Follower in DC East? The strategy chosen determines everything: latency, consistency and availability, and behavior during network partition.

Replication = backup. With 3 replicas the data is safe.

Replication and backup are different things. Replicas hold the current state and will also apply DELETE or corruption. Backup is a separate point-in-time copy.

Running DROP TABLE causes all 3 replicas to apply that operation within seconds. Replication protects against hardware failure, not against operator error or bugs. Both replicas and point-in-time backup in separate storage are needed.

GitLab lost production data even though backup scripts were configured. Which replication principle was violated?

Leader-Follower: one writes, all read

**The most common model:** one node (Leader, Primary, Master) accepts all writes. Others (Followers, Replicas, Standby) receive changes through a replication log and serve read requests. PostgreSQL, MySQL, MongoDB replica set, Redis Sentinel - all use this pattern.

Synchronous vs Asynchronous replication

Mode	Write latency	Guarantee	Loss on failover	Example
Synchronous	High (+roundtrip to replica)	Strong: follower confirmed before client ACK	0 transactions	PostgreSQL synchronous_commit=on
Asynchronous	Low (local commit)	Eventual: replica lags by N operations	Up to several seconds of data	MySQL async, Redis AOF async
Semi-synchronous	Medium	1 replica confirmed before ACK	Minimal	MySQL semi-sync, PostgreSQL remote_apply

**Replication lag** is a fundamental problem of asynchronous replication. The client writes to leader, immediately reads from follower - sees stale data. Solutions: sticky sessions (read from the same replica that received the write), `read-own-writes` consistency (the client reads recent writes back from the leader), monitoring lag via `pg_stat_replication.write_lag`.

PostgreSQL is configured with asynchronous replication. The leader accepted a transaction and crashed 200 ms later. What happens to the data?

Multi-Leader and conflicts: when one leader is the bottleneck

**The single leader problem:** all writes go through one node. With datacenters in Tokyo, Dublin, and Virginia, a user in Tokyo waits for a roundtrip to Virginia on every write (170+ ms). Multi-Leader solves this: each DC has its own leader, clients write to the nearest one.

Conflicts: the unavoidable cost of Multi-Leader

Classic write conflict

DC West: user.name = "Alice" (timestamp 10:00:00.100). DC East: user.name = "Bob" (timestamp 10:00:00.150). Both DCs applied the write locally. When replicas synchronize - whose data is correct? Timestamps do not help: clocks across servers drift by tens of milliseconds.

Strategy	Mechanism	When to use
Last Write Wins (LWW)	Write with the highest timestamp wins	Cassandra default. Simple but risks data loss
First Write Wins	First write wins, subsequent are rejected	Unique username registration - first one matters
Merge / CRDT	Values merged by data type rules	Counters, sets, documents like Google Docs
Custom application logic	Conflict passed to business logic	Shopping cart (Amazon Dynamo), versioned documents

**In practice:** Multi-Leader is rarely used within a single DC - single leader is simpler and faster there. Main uses: geographic distribution (multiple regions), offline clients (Google Docs, CouchDB - each client acts as leader while offline), collaborative editing.

Multi-Leader is better than Single Leader - more leaders means more throughput

Multi-Leader adds conflict complexity. It is suitable only for geographic distribution or offline clients.

Within a single datacenter, Single Leader is simpler and gives predictable consistency. Multi-Leader is needed when latency to a single leader is unacceptable (50+ ms cross-region roundtrip) or when the system must work offline.

Google Docs lets two users edit a document simultaneously without locks. What conflict resolution approach is used?

Leaderless and quorums: Amazon Dynamo against convention

**Amazon, 2007. Werner Vogels publishes the Dynamo paper.** No leader at all - the client writes to N nodes simultaneously and waits for confirmation from W of them. On reads, it queries R nodes and picks the freshest value. When W+R>N the sets overlap: at least one of the R nodes saw the last write. Cassandra, Riak, and DynamoDB are all built on this idea - they place replicas on a hash ring.

Node	Action	Result
Node 1	Receives write request	ACK
Node 2	Receives write request	ACK
Node 3	Down	Write fails
Total	Quorum W=2 reached (2 of 3 ACK)	Write succeeds

**Quorum equation:** N = number of replicas, W = write quorum, R = read quorum. Guarantee: when W + R > N, at least one node in R has seen the latest write. Typical configurations: N=3, W=2, R=2 (balanced); N=3, W=3, R=1 (maximum write durability); N=3, W=1, R=3 (maximum write speed).

Model	Strengths	Weaknesses	Examples
Leader-Follower	Simplicity, strong consistency, ordering guaranteed	Single write bottleneck, failover requires election	PostgreSQL, MySQL, MongoDB RS
Multi-Leader	Low latency across regions, offline work	Conflicts are unavoidable, complex consistency	CouchDB, Google Docs (OT), Cassandra multi-DC
Leaderless	High availability, no SPOF, tunable consistency	Eventual consistency, read repair overhead, versioning needed	Cassandra, Riak, DynamoDB

W+R>N guarantees strong consistency like a single server

W+R>N only guarantees that at least one of the read nodes saw the latest write - but picking the correct value still requires versioning.

When two nodes return different versions, a mechanism is needed to determine which is fresher: vector clocks, timestamps (unreliable without clock sync), or CRDT semantics. Without this, W+R>N does not prevent reading stale data.

Cassandra is configured with N=5, W=3, R=3. How many nodes can fail without losing the ability to read data?

Вопросы для размышления

When designing a user session storage system (one million active sessions, 5-minute TTL) - which replication model and quorum configuration to choose and why? What matters more: never losing a session or never delaying a response?

Связанные уроки

ds-02-cap-theorem — CAP constrains what replication can guarantee
ds-03-consensus — Consensus needed for synchronous replication safety
ds-04-clocks — Clocks determine replication lag measurement
dist-07-2pc — 2PC is one way to make replication atomic
dist-09-raft — Raft implements consensus-based log replication
db-13-transactions