Real-Time Backend
The Concurrent Edits Problem
It's Friday 6pm and two Figma employees drag the same design component at the same time. Whose edit wins, and why is that an architecture question, not a luck question?
- Google Docs handles ~4.5B edits per day. Without solving concurrent edits, every shared document would turn into garbage
- Cassandra (Netflix, Instagram, Discord) applies LWW by default. With node clock skew above 1ms, fresh data can be lost
- Knight Capital Group lost $440M in 45 minutes in 2012 due to a lost update in a trading algorithm. No protection against concurrent writes in a critical system
Concurrent Edits
A concurrent edit is when two or more clients modify the same data at the same time without knowing about each other. Google Docs handles ~4.5 billion edits per day, every one of them potentially racing with another.
The problem comes from network latency: client A sends a change, client B also sends a change, and both believe they are working on the current version of the document. The server receives two conflicting states and has to decide what to do.
Concurrent edits are not a bug in a specific system. They are an inevitable consequence of the CAP theorem: if a system is available (A) and partition tolerant (P), it cannot guarantee consistency (C) at the moment of concurrent edits.
- Figma: ~8M users edit designs simultaneously, concurrent edits happen at the level of individual tree nodes
- Notion: multi-cursor block editing, every block updates independently
- VS Code Live Share: collaborative code editing with character-level operations
Client A and client B both read counter = 10, each adds 1, then writes back. What is the final counter value?
Lost Updates
A lost update is when one client's change is fully overwritten by another client's change, and the first one disappears without a trace. This is the most common conflict in any system with shared state.
In 2012 a lost update bug at Knight Capital Group cost $440M in 45 minutes: the trading algorithm did concurrent writes to shared position state without locking.
Standard fixes for lost updates: pessimistic locking (lock before read), optimistic concurrency control (versioning), atomic operations like INCREMENT instead of READ-MODIFY-WRITE.
- Pessimistic lock: SELECT FOR UPDATE locks the row until commit. Works, but kills throughput under high contention
- Optimistic lock: read version=5, write WHERE version=5. If someone changed it, you get 0 affected rows and retry
- Atomic op: instead of UPDATE SET balance=1500, write UPDATE SET balance=balance+500. Atomic at the DB level
Which mechanism prevents lost updates under high contention (thousands of requests/sec on one row) with the least throughput impact?
Last-Write-Wins
Last-Write-Wins (LWW) is a conflict resolution strategy: when several concurrent writes happen, the one with the highest timestamp wins. It is the simplest strategy and the most dangerous.
Amazon DynamoDB uses LWW by default when multi-master is enabled (Global Tables). In 2019, Slack ran into a problem: with clock skew above 50ms between datacenters, LWW produced unpredictable results when updating user statuses.
LWW is safe only for idempotent operations with monotonically increasing values (last known GPS position, last online/offline status). For accumulating operations (counters, shopping cart), LWW is guaranteed to lose data.
- Redis: LWW via MULTI/EXEC with version check. Used in 50%+ of production Redis installs
- Cassandra: LWW by default for all writes, Lamport clock for ordering
- Riak: LWW as one of three strategies (alongside vector clocks and CRDT)
A system uses LWW. Client A (clock behind by 200ms) writes a value at t=1000ms. Client B (accurate clock) writes a different value at t=900ms. What ends up in the database?
Types of conflicts
Not all conflicts are equal. Knowing the type of conflict determines the resolution strategy: some can be merged automatically, some require an explicit user choice, and some aren't conflicts at all with the right data structure.
- Write-write conflict: two clients change the same field, requires an explicit strategy (LWW, merge, reject)
- Read-write conflict: a client reads data while another modifies it, leading to torn reads
- Phantom conflict: two clients insert rows in overlapping ranges, classic with parallel inserts into range-partitioned tables
Linear conflict resolution (apply in some order) only works for commutative operations. Text editors need either Operational Transformation (Google Docs) or CRDT (Figma, Notion). Both guarantee eventual consistency without losing edits.
Practical rule: if an operation is commutative and associative, it can be merged automatically. Add(x) + Add(y) = Add(y) + Add(x). If not, you need OT, CRDT, or a user-driven merge.
LWW always loses data, it's a bad strategy that shouldn't be used in production
LWW is a great fit for idempotent and monotonically-increasing data: GPS positions, statuses, settings
Data loss under LWW only happens for accumulating operations (counters, lists). For 'last known state' data, LWW is the semantically correct strategy. The older value didn't need to be kept anyway.
Which type of conflict does NOT require an explicit resolution strategy when the data structure is chosen correctly?
Key takeaways
- Concurrent edits happen when two clients read the same state and modify it independently. Networks make this unavoidable in any distributed system
- A lost update is a special case of concurrent edit: one write silently overwrites another. Fix with atomic operations, OCC, or pessimistic locks
- LWW (Last-Write-Wins) resolves conflicts by timestamp. Simple to implement, dangerous under clock skew, unfit for accumulating operations
- The type of conflict drives the strategy: commutative operations (Add-to-Set) merge without loss, non-commutative ones (text inserts) need OT or CRDT
Related topics
The concurrent edits problem is the entry point into real-time architecture:
- CRDT — CRDT is the mathematically rigorous solution to concurrent edits for commutative operations, with no central coordinator
- Operational Transformation — OT resolves concurrent edits on sequence structures (text, lists) by transforming operations with context
- Vector Clocks — Vector clocks give a causally consistent event ordering. They replace wall-clock timestamps in LWW to remove the clock skew problem
Вопросы для размышления
- If a system uses LWW and clock skew between servers is 500ms, what percentage of writes risks being lost under heavy concurrent load?
- Why is a shopping cart the classic case where LWW is unacceptable, while a page view counter is the opposite case where LWW is fine?
- Under what condition does optimistic concurrency control (versioning) turn from a solution into a new problem (retry storm)?
Связанные уроки
- rt-46 — CRDT is the mathematically rigorous solution to concurrent edits for commutative operations, with no central coordinator
- rt-47 — Operational Transformation resolves concurrent edits on sequence structures (text, lists) by transforming operations with context
- db-14-mvcc