Databases
Replication
GitLab 2017: a database administrator ran a replication resync command on the wrong server - the production primary instead of the replica. 300 GB of data were deleted in minutes. Six backup systems were in place; none produced a usable restore. The incident took the site down for 18 hours and was streamed live on YouTube as engineers attempted recovery. The root cause was not the accidental deletion - it was untested failover procedures and unclear replication topology.
- **Instagram**: 30+ PostgreSQL read replicas serve 95% of read traffic. Replication lag above 10 seconds triggers alerts and read traffic is automatically rerouted to the primary.
- **Booking.com**: Galera 3-node active-active cluster for the core booking database. All three nodes accept writes; Galera's certification handles conflicts without application changes.
- **GitHub Orchestrator**: open-source automated failover tool built at GitHub, now used across the industry. Detects primary failure, promotes the best replica, and reconfigures the replication topology automatically.
Master-Slave Replication
Master-slave (primary-replica) replication streams write-ahead log (WAL) or binlog events from the primary to one or more replicas. Replicas apply the same operations in order, producing an eventually consistent copy. Reads can be served from replicas, distributing load and reducing latency for geographically distributed users.
Instagram ran 30+ PostgreSQL read replicas at peak. Engineers routed 95% of read traffic to replicas, keeping the primary focused on writes. Replication lag monitoring (pg_stat_replication lag column) became a critical SLO metric - lag above 10 seconds triggered alerts.
A user writes a post and immediately reads it back - the post is missing. What is the most likely cause in a master-slave setup?
Synchronous vs Asynchronous Replication
Asynchronous replication acknowledges the write to the client after the primary commits, without waiting for replicas. This minimizes write latency but risks data loss if the primary crashes before the replica applies the change. Synchronous replication waits for at least one replica to confirm receipt before acknowledging the client - zero lag, but higher write latency and availability risk.
Synchronous replication blocks writes if the sync standby goes offline. PostgreSQL will hang all writes until the standby reconnects or the configuration is changed. For financial systems requiring zero data loss, this is acceptable. For high-throughput APIs, most teams use async with point-in-time recovery (PITR) as a fallback.
A payment processing system requires that no committed transaction is ever lost, even if the primary crashes immediately after commit. Which replication mode is required?
Failover and Promotion
When the primary fails, one replica must be promoted to primary. Automated failover tools detect failure (missed heartbeats), elect a new primary, update DNS or load balancer routing, and reconfigure remaining replicas to follow the new primary. The old primary, when it recovers, must rejoin as a replica - not attempt to resume as primary.
GitLab's 2017 incident: a database admin accidentally deleted 300 GB of production data during a manual failover procedure. The incident exposed that 5 backup methods were in place but none had been tested. GitLab now runs Patroni for automated failover and tests restores weekly. The incident directly led to the industry trend of automated, tested failover.
After a primary failure and replica promotion, the old primary recovers. What is the correct action?
Multi-Master Replication
Multi-master (active-active) replication allows writes on multiple nodes simultaneously. This eliminates the single write bottleneck and provides geographic write availability. The fundamental challenge is conflict resolution: two clients on different nodes update the same row concurrently - who wins?
Multi-master adds significant complexity. Galera's certification-based conflict detection adds 1-3ms latency per commit for cluster-wide broadcast. Cross-datacenter multi-master (active-active geo) requires careful application design - most teams use active-passive geo with failover instead.
Two nodes in a multi-master cluster concurrently update the same row. What determines the outcome in Galera's certification-based approach?
Quorum and Consensus Replication
Consensus algorithms (Raft, Paxos) provide strongly consistent replication with automatic leader election. A write is committed only when a quorum (majority) of nodes acknowledge it. With N nodes, the cluster tolerates floor(N/2) failures while remaining operational. Raft is used in etcd (Kubernetes), CockroachDB, and TiDB.
etcd, the backing store for all Kubernetes cluster state, uses Raft. A 3-node etcd cluster is the minimum for production Kubernetes - it tolerates one node failure. Large clusters (5-7 nodes) are used for critical infrastructure. etcd processes ~10,000 write operations per second in large clusters.
A 5-node Raft cluster has 2 nodes fail simultaneously. Can the cluster still accept writes?
Key Ideas
- **Async replication** minimizes write latency but can lose the last few seconds of data on primary failure. Suitable for read scaling when some lag is acceptable.
- **Sync replication** guarantees zero data loss but adds write latency and blocks if the sync replica goes offline. Required for financial and payment workloads.
- **Failover** must be automated and tested. The old primary must rejoin as a replica (pg_rewind), not resume as primary - split-brain causes permanent data divergence.
- **Multi-master** enables geographic write distribution but requires conflict resolution strategy (LWW, CRDTs, or certification). Galera uses optimistic certification with rollback.
- **Consensus (Raft)** provides strongly consistent replication with automatic leader election. N=5 tolerates 2 failures; used in etcd, CockroachDB, and TiDB.
Related Topics
Replication intersects with distributed systems theory and operational practices:
- Sharding — Replication scales reads; sharding scales writes. Production systems combine both - each shard has its own replica set.
- CAP Theorem — Sync replication prioritizes Consistency over Availability. Async replication accepts eventual consistency for higher availability.
- Backup and Recovery — Replication is not a backup - both replicas reflect any corruption or deletion. PITR backups complement replication for point-in-time recovery.
Вопросы для размышления
- If a 3-node synchronous cluster loses its network connection between datacenters, which side should continue accepting writes, and why?
- Instagram uses 30 read replicas. How does the application decide which replica to route a specific request to?
- CRDTs resolve conflicts without coordination. What types of data are naturally CRDT-friendly, and what types are not?