Distributed Systems

Data Replication

Цели урока

  • Explain three reasons for replication and when each is relevant
  • Describe the Leader-Follower model and the sync vs async replication tradeoff
  • Understand the conflict problem in Multi-Leader and main resolution strategies
  • Calculate W/R/N quorums for a Leaderless system given specific requirements

Предварительные знания

  • Understanding of the CAP theorem and eventual consistency
  • Basic knowledge of transactions and WAL (Write-Ahead Log)
  • Familiarity with distributed failure modes

GitHub, October 2018: 24 seconds of network disruption desynced MySQL replicas and caused 5 hours of degraded service. The choice of replication model is a choice between speed, consistency, and conflict complexity.

  • **PostgreSQL streaming replication** - Leader-Follower with synchronous_commit=remote_apply for financial data (no data loss on failover)
  • **Cassandra QUORUM** - Leaderless W=2, R=2, N=3 at Netflix for user settings and queues
  • **Google Docs** - Multi-Leader with CRDT-based conflict resolution, enabling offline editing and concurrent edits
  • **CouchDB** - Multi-Leader for mobile apps, conflicts resolved via a user-visible revision tree
  • **DynamoDB Global Tables** - Multi-Leader across AWS regions, LWW by default with optional custom conflict resolution

Amazon Dynamo and the Birth of NoSQL

In 2007 Amazon published 'Dynamo: Amazon's Highly Available Key-value Store'. Engineers described a system with no single leader: quorums, consistent hashing, vector clocks, hinted handoff. The paper became the manifesto of the NoSQL movement. Cassandra, Riak, DynamoDB all implemented Dynamo ideas. One co-author, DeCandia, later described how the team convinced skeptics: 'we cannot let a single datacenter outage stop a customer's shopping cart'.

Why Replicate Data

**GitHub, October 28, 2018. At 22:52 UTC a network engineer swaps equipment in one datacenter - traffic briefly reroutes through Asia. In 24 seconds, MySQL master-replica pairs fall out of sync. Result: 4 minutes of full outage and 5 hours of degraded mode.** Replication is not just copying - it is an architectural decision about where truth lives and how it propagates across nodes.

Three reasons systems replicate data are not reducible to each other.

GoalWhat it providesExample
Fault toleranceData remains available when a node failsPostgreSQL streaming replication: primary fails, standby takes over within seconds
Low latencyRead from a geographically close node instead of cross-region roundtripNetflix: 13 AWS regions, video served from the nearest one (p50 < 20 ms vs 200+ ms)
Read scalingLoad distributed across replicas - horizontal scale without shardingInstagram used PostgreSQL + 20 read replicas before migrating to Cassandra

**The core challenge of replication** is not copying data - it is managing changes. Every write must reach all replicas in the correct order. If a replica lags 1 second behind, thousands of transactions may have arrived in that window. How to handle conflicts? In what order to apply? This is what drives the choice of replication model.

A replica is the same as a backup

A replica is a live copy synchronized in real time. A backup is a snapshot of past state.

Running DROP TABLE or a mass DELETE replicates to all replicas instantly. Backup is unaffected. Replication does not replace a backup strategy.

A system reads data 10x more often than it writes. What is the primary goal of replication in this case?

Leader-Follower: Single Source of Truth

Leader-Follower (Master-Replica) is the most common replication model. One leader node (elected by quorum) accepts all writes. Follower nodes receive a copy via the replication log and serve read requests.

Synchronous vs Asynchronous Replication

ModeWrite latencyGuaranteeUse case
SynchronousHigh (+RTT to follower)Follower confirms - data exists on 2+ nodesPostgreSQL synchronous_commit=on, financial data
AsynchronousLow (local write only)Eventual - follower will catch up eventuallyMySQL default, metrics, logs, non-critical data
Semi-synchronousMediumAt least 1 follower confirmedMySQL semi-sync - durability vs latency tradeoff

**Replication lag** - how far a follower is behind the leader. With async replication, a client may read stale data from a replica. PostgreSQL exposes this via `pg_stat_replication.replay_lag`. Typical values: 10-100 ms under normal conditions, seconds under network or disk pressure.

Synchronous replication is always better - more reliable

Synchronous replication adds the follower RTT to every write latency and blocks when the follower is unavailable

PostgreSQL with synchronous_commit=on waits for all synchronous standbys to confirm. If one is temporarily unavailable - all writes wait. The practical compromise: one synchronous standby (not all) plus async for the rest.

Primary fails during async replication. The follower applied 990 out of 1000 transactions. What happens on failover?

Multi-Leader: Multiple Sources of Truth

Multi-Leader (Active-Active) - multiple nodes accept writes. Each leader replicates its changes to the others. The main advantage is writing to the nearest datacenter without cross-region roundtrip. The main problem is **write conflicts**, often resolved through CRDTs.

Write conflict in multi-leader

DC West receives a request: `user.name = "Alice"`. DC East simultaneously receives: `user.name = "Bob"`. Both leaders apply the write locally and send a replica to each other. After 80 ms (US inter-region RTT) both detect a conflict. Whose version wins?

StrategyHow it worksProblem
Last Write Wins (LWW)Write with the highest timestamp winsClocks are not synchronized - NTP drift 10+ ms. Cassandra uses LWW by default
First Write WinsFirst write wins, subsequent ones are rejectedRequires coordination to determine 'first' - loses the benefit of multi-leader
Merge / CRDTValues are merged according to data type logicWorks for counters, sets, but not for arbitrary strings
Application logicConflict is passed to the application layer for resolutionComplex to implement correctly, but maximally flexible

**Where multi-leader is used:** CouchDB (offline-first sync), Google Docs (CRDT-based), PlanetScale (Vitess multi-region), GitLab Geo (geo-replication). Avoid for financial data and cases where conflicts cannot be resolved without data loss.

CouchDB is used for a mobile app with offline mode. A user edits a document offline and syncs later. Which conflict strategy fits best?

Leaderless: Quorums Instead of a Leader

**Amazon Dynamo, 2007 - a paper that changed the industry.** Werner Vogels and team described a system with no single leader: the client writes to N nodes, a write is successful when W of them confirm. Reading from R nodes returns the freshest version. DynamoDB, Cassandra, Riak - all grew from this idea. The consistency guarantee: W + R > N.

NodeActionResult
Node 1Receives write requestACK
Node 2Receives write requestACK
Node 3DownWrite fails
TotalQuorum W=2 reached (2 of 3 ACK)Write succeeds
W/R/N configPropertyUsed for
W=1, R=NFast write, slow readWrite-heavy systems, logs
W=N, R=1Slow write, fast readRead-heavy with strict consistency
W=2, R=2, N=3Balanced - 1 node can failCassandra default (QUORUM)
W=1, R=1Maximum speed, no guaranteesEventual consistency, caches

**Sloppy quorum** - when the responsible replicas are unavailable, Dynamo-style systems write to any W available nodes, even ones not normally responsible for the key. After recovery - hinted handoff transfers data to the correct nodes. Cassandra does this by default.

W + R > N guarantees strong consistency like an RDBMS

W + R > N guarantees that at least one node with the latest version participates in reads, but not atomicity or isolation

Leaderless systems do not support cross-key transactions by default. Two concurrent writes to different replicas can create a conflict. W+R>N is necessary but not sufficient for strong consistency.

Cassandra cluster: N=5, W=3, R=3. Two nodes go down. Does the system continue accepting writes?

Вопросы для размышления

  • Consider a service used daily (messenger, banking app, cloud storage). What replication model does it most likely use and why is that choice justified for its specific use case?

Связанные уроки

  • ds-08-vector-clocks — Multi-master replication needs vector clocks for conflicts
  • ds-10-crdts — CRDT-based replication is a coordination-free strategy
  • dist-12-consistency — Replication lag directly limits achievable consistency level
  • db-13-transactions
Data Replication

0

1

Sign In