Distributed Systems
Data Replication
Цели урока
- Explain three reasons for replication and when each is relevant
- Describe the Leader-Follower model and the sync vs async replication tradeoff
- Understand the conflict problem in Multi-Leader and main resolution strategies
- Calculate W/R/N quorums for a Leaderless system given specific requirements
Предварительные знания
- Understanding of the CAP theorem and eventual consistency
- Basic knowledge of transactions and WAL (Write-Ahead Log)
- Familiarity with distributed failure modes
GitHub, October 2018: 24 seconds of network disruption desynced MySQL replicas and caused 5 hours of degraded service. The choice of replication model is a choice between speed, consistency, and conflict complexity.
- **PostgreSQL streaming replication** - Leader-Follower with synchronous_commit=remote_apply for financial data (no data loss on failover)
- **Cassandra QUORUM** - Leaderless W=2, R=2, N=3 at Netflix for user settings and queues
- **Google Docs** - Multi-Leader with CRDT-based conflict resolution, enabling offline editing and concurrent edits
- **CouchDB** - Multi-Leader for mobile apps, conflicts resolved via a user-visible revision tree
- **DynamoDB Global Tables** - Multi-Leader across AWS regions, LWW by default with optional custom conflict resolution
Amazon Dynamo and the Birth of NoSQL
In 2007 Amazon published 'Dynamo: Amazon's Highly Available Key-value Store'. Engineers described a system with no single leader: quorums, consistent hashing, vector clocks, hinted handoff. The paper became the manifesto of the NoSQL movement. Cassandra, Riak, DynamoDB all implemented Dynamo ideas. One co-author, DeCandia, later described how the team convinced skeptics: 'we cannot let a single datacenter outage stop a customer's shopping cart'.
Why Replicate Data
**GitHub, October 28, 2018. At 22:52 UTC a network engineer swaps equipment in one datacenter - traffic briefly reroutes through Asia. In 24 seconds, MySQL master-replica pairs fall out of sync. Result: 4 minutes of full outage and 5 hours of degraded mode.** Replication is not just copying - it is an architectural decision about where truth lives and how it propagates across nodes.
Three reasons systems replicate data are not reducible to each other.
| Goal | What it provides | Example |
|---|---|---|
| Fault tolerance | Data remains available when a node fails | PostgreSQL streaming replication: primary fails, standby takes over within seconds |
| Low latency | Read from a geographically close node instead of cross-region roundtrip | Netflix: 13 AWS regions, video served from the nearest one (p50 < 20 ms vs 200+ ms) |
| Read scaling | Load distributed across replicas - horizontal scale without sharding | Instagram used PostgreSQL + 20 read replicas before migrating to Cassandra |
**The core challenge of replication** is not copying data - it is managing changes. Every write must reach all replicas in the correct order. If a replica lags 1 second behind, thousands of transactions may have arrived in that window. How to handle conflicts? In what order to apply? This is what drives the choice of replication model.
A replica is the same as a backup
A replica is a live copy synchronized in real time. A backup is a snapshot of past state.
Running DROP TABLE or a mass DELETE replicates to all replicas instantly. Backup is unaffected. Replication does not replace a backup strategy.
A system reads data 10x more often than it writes. What is the primary goal of replication in this case?
Leader-Follower: Single Source of Truth
Leader-Follower (Master-Replica) is the most common replication model. One leader node (elected by quorum) accepts all writes. Follower nodes receive a copy via the replication log and serve read requests.
Synchronous vs Asynchronous Replication
| Mode | Write latency | Guarantee | Use case |
|---|---|---|---|
| Synchronous | High (+RTT to follower) | Follower confirms - data exists on 2+ nodes | PostgreSQL synchronous_commit=on, financial data |
| Asynchronous | Low (local write only) | Eventual - follower will catch up eventually | MySQL default, metrics, logs, non-critical data |
| Semi-synchronous | Medium | At least 1 follower confirmed | MySQL semi-sync - durability vs latency tradeoff |
**Replication lag** - how far a follower is behind the leader. With async replication, a client may read stale data from a replica. PostgreSQL exposes this via `pg_stat_replication.replay_lag`. Typical values: 10-100 ms under normal conditions, seconds under network or disk pressure.
Synchronous replication is always better - more reliable
Synchronous replication adds the follower RTT to every write latency and blocks when the follower is unavailable
PostgreSQL with synchronous_commit=on waits for all synchronous standbys to confirm. If one is temporarily unavailable - all writes wait. The practical compromise: one synchronous standby (not all) plus async for the rest.
Primary fails during async replication. The follower applied 990 out of 1000 transactions. What happens on failover?
Multi-Leader: Multiple Sources of Truth
Multi-Leader (Active-Active) - multiple nodes accept writes. Each leader replicates its changes to the others. The main advantage is writing to the nearest datacenter without cross-region roundtrip. The main problem is **write conflicts**, often resolved through CRDTs.
Write conflict in multi-leader
DC West receives a request: `user.name = "Alice"`. DC East simultaneously receives: `user.name = "Bob"`. Both leaders apply the write locally and send a replica to each other. After 80 ms (US inter-region RTT) both detect a conflict. Whose version wins?
| Strategy | How it works | Problem |
|---|---|---|
| Last Write Wins (LWW) | Write with the highest timestamp wins | Clocks are not synchronized - NTP drift 10+ ms. Cassandra uses LWW by default |
| First Write Wins | First write wins, subsequent ones are rejected | Requires coordination to determine 'first' - loses the benefit of multi-leader |
| Merge / CRDT | Values are merged according to data type logic | Works for counters, sets, but not for arbitrary strings |
| Application logic | Conflict is passed to the application layer for resolution | Complex to implement correctly, but maximally flexible |
**Where multi-leader is used:** CouchDB (offline-first sync), Google Docs (CRDT-based), PlanetScale (Vitess multi-region), GitLab Geo (geo-replication). Avoid for financial data and cases where conflicts cannot be resolved without data loss.
CouchDB is used for a mobile app with offline mode. A user edits a document offline and syncs later. Which conflict strategy fits best?
Leaderless: Quorums Instead of a Leader
**Amazon Dynamo, 2007 - a paper that changed the industry.** Werner Vogels and team described a system with no single leader: the client writes to N nodes, a write is successful when W of them confirm. Reading from R nodes returns the freshest version. DynamoDB, Cassandra, Riak - all grew from this idea. The consistency guarantee: W + R > N.
| Node | Action | Result |
|---|---|---|
| Node 1 | Receives write request | ACK |
| Node 2 | Receives write request | ACK |
| Node 3 | Down | Write fails |
| Total | Quorum W=2 reached (2 of 3 ACK) | Write succeeds |
| W/R/N config | Property | Used for |
|---|---|---|
| W=1, R=N | Fast write, slow read | Write-heavy systems, logs |
| W=N, R=1 | Slow write, fast read | Read-heavy with strict consistency |
| W=2, R=2, N=3 | Balanced - 1 node can fail | Cassandra default (QUORUM) |
| W=1, R=1 | Maximum speed, no guarantees | Eventual consistency, caches |
**Sloppy quorum** - when the responsible replicas are unavailable, Dynamo-style systems write to any W available nodes, even ones not normally responsible for the key. After recovery - hinted handoff transfers data to the correct nodes. Cassandra does this by default.
W + R > N guarantees strong consistency like an RDBMS
W + R > N guarantees that at least one node with the latest version participates in reads, but not atomicity or isolation
Leaderless systems do not support cross-key transactions by default. Two concurrent writes to different replicas can create a conflict. W+R>N is necessary but not sufficient for strong consistency.
Cassandra cluster: N=5, W=3, R=3. Two nodes go down. Does the system continue accepting writes?
Вопросы для размышления
- Consider a service used daily (messenger, banking app, cloud storage). What replication model does it most likely use and why is that choice justified for its specific use case?
Связанные уроки
- ds-08-vector-clocks — Multi-master replication needs vector clocks for conflicts
- ds-10-crdts — CRDT-based replication is a coordination-free strategy
- dist-12-consistency — Replication lag directly limits achievable consistency level
- db-13-transactions