Real-Time Backend

Introduction to CRDT

Two people edit a Figma file at the same time. One adds a button, the other deletes the same frame. The network drops for 3 seconds. Who is right? CRDT makes this question mathematically solvable, with no central arbiter, no locks, and a guarantee that both clients reach the same state.

  • **Figma** uses CmRDT for real-time collaboration: 100+ editors in one file, 50M operations/day, p99 operation-apply latency under 10ms - with no single lock on the document
  • **Redis Enterprise** implements CRDT counters and CRDT sets for geo-distributed clusters: 5 datacenters sync through OR-Set and PN-Counter, conflicts resolved automatically by Add-Wins semantics
  • **Riak 2.0** (Basho, 2014) was the first production NoSQL DB with native CRDT types: Maps, Sets, Counters, Registers. LinkedIn used Riak CRDT to store a social graph of 400M users
  • **Apple Notes** and **iCloud** use state-based CRDT for offline-first sync: edits on an iPhone with no network apply locally, and when the connection returns the merge happens deterministically

What CRDT is

In 2011 the Google Docs and Riak teams ran into the same problem: two users edit one document, the network drops for 200ms. Whose changes win? Classic Last-Write-Wins destroys someone's edits. Operational Transformation needs a central arbiter server. Both break under real partitions.

CRDT (Conflict-free Replicated Data Type) is a mathematically guaranteed way to make sure any two nodes that received the same set of operations in any order end up in the same state. Not 'we will try to resolve the conflict', but provable convergence.

The three properties that make it work

  • **Associativity**: merge(merge(A,B),C) = merge(A,merge(B,C)) - grouping order doesn't matter
  • **Commutativity**: merge(A,B) = merge(B,A) - operand order doesn't matter
  • **Idempotency**: merge(A,A) = A - re-applying does not change the result

Any data structure whose merge operation has these three properties is automatically a CRDT. Two replicas that exchanged all operations converge to the same state with no coordination.

Which of the three CRDT properties guarantees that redelivering one operation will not corrupt the state?

CvRDT: state-based (convergent)

State-based CRDT (CvRDT, Convergent) ships the full state between replicas and combines them via a merge function. Redis uses this model in its CRDT counter; Riak builds its entire eventual consistency on CvRDT. Nodes don't have to be online at the same time. When they reconnect, they sync by exchanging states.

Lattice as the mathematical foundation

CvRDT requires the set of states to form a partially ordered lattice (join-semilattice). Merge is the join operation (least upper bound). It is monotonic: state only grows, never shrinks. That monotonicity is what guarantees convergence.

**CvRDT downside:** with large state, full syncs are expensive. Riak solves this with delta-state CvRDT: only the 'delta' since the last exchange is shipped, not the full state. Message size drops by 90-98% with frequent syncs.

Node A has PN-Counter {P:{A:10}, N:{A:2}}. Node B has {P:{B:7}, N:{B:3}}. What is the value after merge?

CmRDT: operation-based (commutative)

Operation-based CRDT (CmRDT, Commutative) ships operations instead of state. This is what powers real-time Google Docs since 2013 and Figma since 2016: 50M operations per day, sub-100ms editing latency, no central lock on the document.

CmRDT requires reliable causal broadcast: every operation must reach every replica exactly once, and after every operation it depends on. In exchange you get compact messages and O(1) operation application.

**CvRDT vs CmRDT trade-off:** CvRDT is simpler to implement (no transport requirements) but expensive with large state. CmRDT is compact but needs exactly-once causal delivery, which means a reliable message broker (Kafka, NATS JetStream). Figma chose CmRDT + WebSocket with NATS for compact operations with 100+ concurrent editors in a single file.

Two clients act at the same time: A adds element 'x', B removes 'x' (which was already there). OR-Set with Add-Wins semantics - what is in the final state?

The math: join-semilattices and monotonic functions

Behind every CRDT sits one algebraic structure: the join-semilattice. A partially ordered set (S, ≤) with a join operation (⊔) that returns the least upper bound of any two elements. This is what mathematicians call a 'lattice'.

CALM Theorem: the boundary of applicability

CALM (Consistency As Logical Monotonicity) is Hellerstein's 2011 theorem. It became the theoretical foundation for Berkeley's Bloom DSL and influenced distributed systems design for the next decade. The statement: a program converges without coordination if and only if it is monotonic. Non-monotonic operations (delete, overwrite, conditional branches) require either coordination or special packaging into a CRDT.

  • **Monotonic operations** (set-union, append, max, min) - free convergence, CRDT-friendly
  • **Non-monotonic operations** (set-difference, overwrite) - need wrapping in a tombstone/tag mechanism
  • **In practice**: Figma stores all deleted objects as 'tombstone' records, turning remove into a monotonic add-to-deleted-set

CRDT solves all consistency problems - just use CRDT and stop worrying about conflicts

CRDT guarantees state convergence, but the semantics of the result (Add-Wins vs Remove-Wins, last-write wins) is up to the developer. CRDT does not answer 'what is the right result', only 'all nodes will reach the same result'.

Figma had to pick between Add-Wins and Remove-Wins for concurrent delete+create of an object: both are valid CRDTs, but they produce different UX. Choosing semantics is a product decision, not a technical one. CRDT removes the nondeterminism but not the need to design conflict behavior.

By CALM theorem, which operation does NOT require coordination between nodes for convergence?

Key takeaways

  • CRDT is a data structure with mathematically guaranteed convergence: any two nodes that received the same set of operations reach the same state. Three merge properties: associativity, commutativity, idempotency.
  • CvRDT (state-based) ships the full state and combines via a join in a join-semilattice. Simpler to implement, expensive with large state. Used in Redis CRDT and Riak.
  • CmRDT (operation-based) ships operations and requires causal exactly-once delivery. More compact, scales better. Used in Figma and Google Docs.
  • CALM theorem: monotonic functions converge without coordination. Non-monotonic operations (delete, overwrite) need wrapping in tombstone/tag mechanisms.

Related topics

CRDT rests on several core distributed systems ideas:

  • Eventual Consistency — CRDT is a concrete mathematical implementation of eventual consistency with provable convergence instead of 'best effort'
  • Vector Clocks — CmRDT uses vector clocks for causal ordering of operations. Without them you can't determine dependencies between concurrent ops
  • CAP Theorem — CRDT lets you trade strong consistency (C in CAP) in a predictable way, picking availability and partition tolerance with a guarantee of eventual convergence

Вопросы для размышления

  • Figma picks Add-Wins for concurrent add/delete of an object. In what product would Remove-Wins make sense, and how would the UX change?
  • G-Counter only allows increments. How would you design a CRDT for a bank account that needs both increment and decrement but where the balance must never go negative?
  • CALM theorem says monotonic programs need no coordination. Can any distributed system be rewritten using only monotonic operations, or are there hard limits?

Связанные уроки

  • rt-46 — OT solves the same problem (concurrent edits) but encodes order in the algorithm instead of in the data structure
  • rt-48 — Specific CRDT types (G-Counter, OR-Set, etc.) build on the foundational properties introduced here
  • db-04-cap
Introduction to CRDT

0

1

Sign In