Distributed Systems

Data Replication

Цели урока

Explain three reasons for replication and when each is relevant
Describe the Leader-Follower model and the sync vs async replication tradeoff
Understand the conflict problem in Multi-Leader and main resolution strategies
Calculate W/R/N quorums for a Leaderless system given specific requirements

Предварительные знания

Understanding of the CAP theorem and eventual consistency
Basic knowledge of transactions and WAL (Write-Ahead Log)
Familiarity with distributed failure modes

GitHub, October 2018: 24 seconds of network disruption desynced MySQL replicas and caused 5 hours of degraded service. The choice of replication model is a choice between speed, consistency, and conflict complexity.

**PostgreSQL streaming replication** - Leader-Follower with synchronous_commit=remote_apply for financial data (no data loss on failover)
**Cassandra QUORUM** - Leaderless W=2, R=2, N=3 at Netflix for user settings and queues
**Google Docs** - Multi-Leader with CRDT-based conflict resolution, enabling offline editing and concurrent edits
**CouchDB** - Multi-Leader for mobile apps, conflicts resolved via a user-visible revision tree
**DynamoDB Global Tables** - Multi-Leader across AWS regions, LWW by default with optional custom conflict resolution

Amazon Dynamo and the Birth of NoSQL

In 2007 Amazon published 'Dynamo: Amazon's Highly Available Key-value Store'. Engineers described a system with no single leader: quorums, consistent hashing, vector clocks, hinted handoff. The paper became the manifesto of the NoSQL movement. Cassandra, Riak, DynamoDB all implemented Dynamo ideas. One co-author, DeCandia, later described how the team convinced skeptics: 'we cannot let a single datacenter outage stop a customer's shopping cart'.

Why Replicate Data

**GitHub, October 28, 2018. At 22:52 UTC a network engineer swaps equipment in one datacenter - traffic briefly reroutes through Asia. In 24 seconds, MySQL master-replica pairs fall out of sync. Result: 4 minutes of full outage and 5 hours of degraded mode.** Replication is not just copying - it is an architectural decision about where truth lives and how it propagates across nodes.

Three reasons systems replicate data are not reducible to each other.

Goal	What it provides	Example
Fault tolerance	Data remains available when a node fails	PostgreSQL streaming replication: primary fails, standby takes over within seconds
Low latency	Read from a geographically close node instead of cross-region roundtrip	Netflix: 13 AWS regions, video served from the nearest one (p50 < 20 ms vs 200+ ms)
Read scaling	Load distributed across replicas - horizontal scale without sharding	Instagram used PostgreSQL + 20 read replicas before migrating to Cassandra

**The core challenge of replication** is not copying data - it is managing changes. Every write must reach all replicas in the correct order. If a replica lags 1 second behind, thousands of transactions may have arrived in that window. How to handle conflicts? In what order to apply? This is what drives the choice of replication model.

A replica is the same as a backup

A replica is a live copy synchronized in real time. A backup is a snapshot of past state.

Running DROP TABLE or a mass DELETE replicates to all replicas instantly. Backup is unaffected. Replication does not replace a backup strategy.

A system reads data 10x more often than it writes. What is the primary goal of replication in this case?

Leader-Follower: Single Source of Truth

Leader-Follower (Master-Replica) is the most common replication model. One leader node (elected by quorum) accepts all writes. Follower nodes receive a copy via the replication log and serve read requests.

Synchronous vs Asynchronous Replication

Mode	Write latency	Guarantee	Use case
Synchronous	High (+RTT to follower)	Follower confirms - data exists on 2+ nodes	PostgreSQL synchronous_commit=on, financial data
Asynchronous	Low (local write only)	Eventual - follower will catch up eventually	MySQL default, metrics, logs, non-critical data
Semi-synchronous	Medium	At least 1 follower confirmed	MySQL semi-sync - durability vs latency tradeoff

**Replication lag** - how far a follower is behind the leader. With async replication, a client may read stale data from a replica. PostgreSQL exposes this via `pg_stat_replication.replay_lag`. Typical values: 10-100 ms under normal conditions, seconds under network or disk pressure.

Synchronous replication is always better - more reliable

Synchronous replication adds the follower RTT to every write latency and blocks when the follower is unavailable

PostgreSQL with synchronous_commit=on waits for all synchronous standbys to confirm. If one is temporarily unavailable - all writes wait. The practical compromise: one synchronous standby (not all) plus async for the rest.

Primary fails during async replication. The follower applied 990 out of 1000 transactions. What happens on failover?

Multi-Leader: Multiple Sources of Truth

Multi-Leader (Active-Active) - multiple nodes accept writes. Each leader replicates its changes to the others. The main advantage is writing to the nearest datacenter without cross-region roundtrip. The main problem is **write conflicts**, often resolved through CRDTs.

Write conflict in multi-leader

DC West receives a request: `user.name = "Alice"`. DC East simultaneously receives: `user.name = "Bob"`. Both leaders apply the write locally and send a replica to each other. After 80 ms (US inter-region RTT) both detect a conflict. Whose version wins?

Strategy	How it works	Problem
Last Write Wins (LWW)	Write with the highest timestamp wins	Clocks are not synchronized - NTP drift 10+ ms. Cassandra uses LWW by default
First Write Wins	First write wins, subsequent ones are rejected	Requires coordination to determine 'first' - loses the benefit of multi-leader
Merge / CRDT	Values are merged according to data type logic	Works for counters, sets, but not for arbitrary strings
Application logic	Conflict is passed to the application layer for resolution	Complex to implement correctly, but maximally flexible

**Where multi-leader is used:** CouchDB (offline-first sync), Google Docs (CRDT-based), PlanetScale (Vitess multi-region), GitLab Geo (geo-replication). Avoid for financial data and cases where conflicts cannot be resolved without data loss.

CouchDB is used for a mobile app with offline mode. A user edits a document offline and syncs later. Which conflict strategy fits best?

Leaderless: Quorums Instead of a Leader

**Amazon Dynamo, 2007 - a paper that changed the industry.** Werner Vogels and team described a system with no single leader: the client writes to N nodes, a write is successful when W of them confirm. Reading from R nodes returns the freshest version. DynamoDB, Cassandra, Riak - all grew from this idea. The consistency guarantee: W + R > N.

Node	Action	Result
Node 1	Receives write request	ACK
Node 2	Receives write request	ACK
Node 3	Down	Write fails
Total	Quorum W=2 reached (2 of 3 ACK)	Write succeeds

W/R/N config	Property	Used for
W=1, R=N	Fast write, slow read	Write-heavy systems, logs
W=N, R=1	Slow write, fast read	Read-heavy with strict consistency
W=2, R=2, N=3	Balanced - 1 node can fail	Cassandra default (QUORUM)
W=1, R=1	Maximum speed, no guarantees	Eventual consistency, caches

**Sloppy quorum** - when the responsible replicas are unavailable, Dynamo-style systems write to any W available nodes, even ones not normally responsible for the key. After recovery - hinted handoff transfers data to the correct nodes. Cassandra does this by default.

W + R > N guarantees strong consistency like an RDBMS

W + R > N guarantees that at least one node with the latest version participates in reads, but not atomicity or isolation

Leaderless systems do not support cross-key transactions by default. Two concurrent writes to different replicas can create a conflict. W+R>N is necessary but not sufficient for strong consistency.

Cassandra cluster: N=5, W=3, R=3. Two nodes go down. Does the system continue accepting writes?

Вопросы для размышления

Consider a service used daily (messenger, banking app, cloud storage). What replication model does it most likely use and why is that choice justified for its specific use case?

Связанные уроки

ds-08-vector-clocks — Multi-master replication needs vector clocks for conflicts
ds-10-crdts — CRDT-based replication is a coordination-free strategy
dist-12-consistency — Replication lag directly limits achievable consistency level
db-13-transactions