Real-Time Backend

The Concurrent Edits Problem

It's Friday 6pm and two Figma employees drag the same design component at the same time. Whose edit wins, and why is that an architecture question, not a luck question?

Google Docs handles ~4.5B edits per day. Without solving concurrent edits, every shared document would turn into garbage
Cassandra (Netflix, Instagram, Discord) applies LWW by default. With node clock skew above 1ms, fresh data can be lost
Knight Capital Group lost $440M in 45 minutes in 2012 due to a lost update in a trading algorithm. No protection against concurrent writes in a critical system

Concurrent Edits

A concurrent edit is when two or more clients modify the same data at the same time without knowing about each other. Google Docs handles ~4.5 billion edits per day, every one of them potentially racing with another.

The problem comes from network latency: client A sends a change, client B also sends a change, and both believe they are working on the current version of the document. The server receives two conflicting states and has to decide what to do.

Concurrent edits are not a bug in a specific system. They are an inevitable consequence of the CAP theorem: if a system is available (A) and partition tolerant (P), it cannot guarantee consistency (C) at the moment of concurrent edits.

Figma: ~8M users edit designs simultaneously, concurrent edits happen at the level of individual tree nodes
Notion: multi-cursor block editing, every block updates independently
VS Code Live Share: collaborative code editing with character-level operations

Client A and client B both read counter = 10, each adds 1, then writes back. What is the final counter value?

Lost Updates

A lost update is when one client's change is fully overwritten by another client's change, and the first one disappears without a trace. This is the most common conflict in any system with shared state.

In 2012 a lost update bug at Knight Capital Group cost $440M in 45 minutes: the trading algorithm did concurrent writes to shared position state without locking.

Standard fixes for lost updates: pessimistic locking (lock before read), optimistic concurrency control (versioning), atomic operations like INCREMENT instead of READ-MODIFY-WRITE.

Pessimistic lock: SELECT FOR UPDATE locks the row until commit. Works, but kills throughput under high contention
Optimistic lock: read version=5, write WHERE version=5. If someone changed it, you get 0 affected rows and retry
Atomic op: instead of UPDATE SET balance=1500, write UPDATE SET balance=balance+500. Atomic at the DB level

Which mechanism prevents lost updates under high contention (thousands of requests/sec on one row) with the least throughput impact?

Last-Write-Wins

Last-Write-Wins (LWW) is a conflict resolution strategy: when several concurrent writes happen, the one with the highest timestamp wins. It is the simplest strategy and the most dangerous.

Amazon DynamoDB uses LWW by default when multi-master is enabled (Global Tables). In 2019, Slack ran into a problem: with clock skew above 50ms between datacenters, LWW produced unpredictable results when updating user statuses.

LWW is safe only for idempotent operations with monotonically increasing values (last known GPS position, last online/offline status). For accumulating operations (counters, shopping cart), LWW is guaranteed to lose data.

Redis: LWW via MULTI/EXEC with version check. Used in 50%+ of production Redis installs
Cassandra: LWW by default for all writes, Lamport clock for ordering
Riak: LWW as one of three strategies (alongside vector clocks and CRDT)

A system uses LWW. Client A (clock behind by 200ms) writes a value at t=1000ms. Client B (accurate clock) writes a different value at t=900ms. What ends up in the database?

Types of conflicts

Not all conflicts are equal. Knowing the type of conflict determines the resolution strategy: some can be merged automatically, some require an explicit user choice, and some aren't conflicts at all with the right data structure.

Write-write conflict: two clients change the same field, requires an explicit strategy (LWW, merge, reject)
Read-write conflict: a client reads data while another modifies it, leading to torn reads
Phantom conflict: two clients insert rows in overlapping ranges, classic with parallel inserts into range-partitioned tables

Linear conflict resolution (apply in some order) only works for commutative operations. Text editors need either Operational Transformation (Google Docs) or CRDT (Figma, Notion). Both guarantee eventual consistency without losing edits.

Practical rule: if an operation is commutative and associative, it can be merged automatically. Add(x) + Add(y) = Add(y) + Add(x). If not, you need OT, CRDT, or a user-driven merge.

LWW always loses data, it's a bad strategy that shouldn't be used in production

LWW is a great fit for idempotent and monotonically-increasing data: GPS positions, statuses, settings

Data loss under LWW only happens for accumulating operations (counters, lists). For 'last known state' data, LWW is the semantically correct strategy. The older value didn't need to be kept anyway.

Which type of conflict does NOT require an explicit resolution strategy when the data structure is chosen correctly?

Key takeaways

Concurrent edits happen when two clients read the same state and modify it independently. Networks make this unavoidable in any distributed system
A lost update is a special case of concurrent edit: one write silently overwrites another. Fix with atomic operations, OCC, or pessimistic locks
LWW (Last-Write-Wins) resolves conflicts by timestamp. Simple to implement, dangerous under clock skew, unfit for accumulating operations
The type of conflict drives the strategy: commutative operations (Add-to-Set) merge without loss, non-commutative ones (text inserts) need OT or CRDT

Вопросы для размышления

If a system uses LWW and clock skew between servers is 500ms, what percentage of writes risks being lost under heavy concurrent load?
Why is a shopping cart the classic case where LWW is unacceptable, while a page view counter is the opposite case where LWW is fine?
Under what condition does optimistic concurrency control (versioning) turn from a solution into a new problem (retry storm)?

Связанные уроки

rt-46 — CRDT is the mathematically rigorous solution to concurrent edits for commutative operations, with no central coordinator
rt-47 — Operational Transformation resolves concurrent edits on sequence structures (text, lists) by transforming operations with context
db-14-mvcc

Real-Time Backend

The Concurrent Edits Problem

It's Friday 6pm and two Figma employees drag the same design component at the same time. Whose edit wins, and why is that an architecture question, not a luck question?

Google Docs handles ~4.5B edits per day. Without solving concurrent edits, every shared document would turn into garbage
Cassandra (Netflix, Instagram, Discord) applies LWW by default. With node clock skew above 1ms, fresh data can be lost
Knight Capital Group lost $440M in 45 minutes in 2012 due to a lost update in a trading algorithm. No protection against concurrent writes in a critical system

Concurrent Edits

Figma: ~8M users edit designs simultaneously, concurrent edits happen at the level of individual tree nodes
Notion: multi-cursor block editing, every block updates independently
VS Code Live Share: collaborative code editing with character-level operations

Client A and client B both read counter = 10, each adds 1, then writes back. What is the final counter value?

Lost Updates

In 2012 a lost update bug at Knight Capital Group cost $440M in 45 minutes: the trading algorithm did concurrent writes to shared position state without locking.

Standard fixes for lost updates: pessimistic locking (lock before read), optimistic concurrency control (versioning), atomic operations like INCREMENT instead of READ-MODIFY-WRITE.

Pessimistic lock: SELECT FOR UPDATE locks the row until commit. Works, but kills throughput under high contention
Optimistic lock: read version=5, write WHERE version=5. If someone changed it, you get 0 affected rows and retry
Atomic op: instead of UPDATE SET balance=1500, write UPDATE SET balance=balance+500. Atomic at the DB level

Which mechanism prevents lost updates under high contention (thousands of requests/sec on one row) with the least throughput impact?

Last-Write-Wins

Last-Write-Wins (LWW) is a conflict resolution strategy: when several concurrent writes happen, the one with the highest timestamp wins. It is the simplest strategy and the most dangerous.

Redis: LWW via MULTI/EXEC with version check. Used in 50%+ of production Redis installs
Cassandra: LWW by default for all writes, Lamport clock for ordering
Riak: LWW as one of three strategies (alongside vector clocks and CRDT)

A system uses LWW. Client A (clock behind by 200ms) writes a value at t=1000ms. Client B (accurate clock) writes a different value at t=900ms. What ends up in the database?

Types of conflicts

Write-write conflict: two clients change the same field, requires an explicit strategy (LWW, merge, reject)
Read-write conflict: a client reads data while another modifies it, leading to torn reads
Phantom conflict: two clients insert rows in overlapping ranges, classic with parallel inserts into range-partitioned tables

Practical rule: if an operation is commutative and associative, it can be merged automatically. Add(x) + Add(y) = Add(y) + Add(x). If not, you need OT, CRDT, or a user-driven merge.

LWW always loses data, it's a bad strategy that shouldn't be used in production

LWW is a great fit for idempotent and monotonically-increasing data: GPS positions, statuses, settings

Data loss under LWW only happens for accumulating operations (counters, lists). For 'last known state' data, LWW is the semantically correct strategy. The older value didn't need to be kept anyway.

Which type of conflict does NOT require an explicit resolution strategy when the data structure is chosen correctly?

Key takeaways

Concurrent edits happen when two clients read the same state and modify it independently. Networks make this unavoidable in any distributed system
A lost update is a special case of concurrent edit: one write silently overwrites another. Fix with atomic operations, OCC, or pessimistic locks
LWW (Last-Write-Wins) resolves conflicts by timestamp. Simple to implement, dangerous under clock skew, unfit for accumulating operations
The type of conflict drives the strategy: commutative operations (Add-to-Set) merge without loss, non-commutative ones (text inserts) need OT or CRDT

Вопросы для размышления

If a system uses LWW and clock skew between servers is 500ms, what percentage of writes risks being lost under heavy concurrent load?
Why is a shopping cart the classic case where LWW is unacceptable, while a page view counter is the opposite case where LWW is fine?
Under what condition does optimistic concurrency control (versioning) turn from a solution into a new problem (retry storm)?

Связанные уроки

rt-46 — CRDT is the mathematically rigorous solution to concurrent edits for commutative operations, with no central coordinator
rt-47 — Operational Transformation resolves concurrent edits on sequence structures (text, lists) by transforming operations with context
db-14-mvcc

The Concurrent Edits Problem

Concurrent Edits

Lost Updates

Last-Write-Wins

Types of conflicts

Key takeaways

Related topics

Вопросы для размышления

Связанные уроки

The Concurrent Edits Problem

Concurrent Edits

Lost Updates

Last-Write-Wins

Types of conflicts

Key takeaways

Related topics

Вопросы для размышления

Связанные уроки