Databases
MVCC: How Databases Handle Concurrency
In Postgres `UPDATE accounts SET balance = balance - 100` does not change the row. It creates a new one. The old one stays alongside, marked as dead. Billions of times a day - in every bank, every stream, every LLM agent memory.
- **Amazon Black Friday 2014:** 6M txn/sec, MVCC gave linear read scaling with zero locks
- **Sentry 2015:** autovacuum fell behind, txid wraparound nearly bricked production - 11M transactions from disaster
- **Discord on Postgres:** trillions of row versions in pg_history, autovacuum runs 24/7
- **CockroachDB and Spanner:** distributed MVCC with HLC (hybrid logical clocks) - same xmin/xmax, distributed
MVCC: Writers Do Not Block Readers
Black Friday 2014, Amazon. Six million transactions per second, and every one wants to read and write the same rows. Classic locking would gridlock the whole thing: one writer, all readers wait. Postgres, MySQL InnoDB, Oracle, and CockroachDB solve this the same way - MVCC. Multi-Version Concurrency Control.
The idea is almost embarrassingly simple. UPDATE does not overwrite a row. It creates a *new version* and marks the old one as obsolete from moment T. Each transaction sees a snapshot of the database as it was at its own start - and reads what was current *then*. Writers and readers stop fighting over the same physical row.
Same logic powers LoRA adapters: base weights stay untouched, new versions stack on top. Same logic in Git: a commit does not erase history, it adds a new snapshot. MVCC is Git inside a table row.
**The key inversion:** pessimistic concurrency makes a reader wait for the writer. MVCC means the reader never waits - it sees the old version sitting next to the new one, visible specifically to its own snapshot.
Why does a SELECT in MVCC not block on an UPDATE to the same row?
xmin/xmax: Birth and Death Stamps
Postgres keeps two numbers on every row: `xmin` - the ID of the transaction that created it, and `xmax` - the ID of the transaction that deleted or replaced it. From these two numbers the database decides whether a row is visible to a given transaction.
The rule is brutally simple. A row is visible to transaction T if: `xmin` is already committed and at or below T's snapshot, and `xmax` is empty, not yet committed, or above T's snapshot. Three conditions, one comparison. Same arithmetic that Adam runs against $\beta_1, \beta_2$ - three parameters decide what gets averaged.
UPDATE is really an INSERT plus a marked DELETE. The old row stays, gets its `xmax` stamped. The new row gets a fresh `xmin`. Two versions live on disk at once. Anyone that started before the UPDATE sees the old one. Anyone that started after COMMIT sees the new one.
**Cost of visibility:** every row carries a 23-byte header with xmin, xmax, flags, and pointers. That is why every UPDATE in Postgres is an INSERT of a new row, even for a one-byte change. Same reason LoRA adds a matrix instead of editing original weights - cheap to store, expensive to mutate.
What does an UPDATE physically mean in Postgres?
Snapshot Isolation: Each Transaction Has Its Own Reality
When a transaction starts, Postgres takes a snapshot: the list of *active* transactions at that moment. This snapshot is the visibility passport. Any row whose `xmin` belongs to a transaction committed before the start is visible. Any row whose `xmin` belongs to an active or future transaction is invisible.
That is Snapshot Isolation - an isolation level Postgres calls REPEATABLE READ and Oracle calls SERIALIZABLE. The names disagree, the implementations agree. Reads always see the data as it stood at BEGIN.
RLHF analogy: a policy snapshot is frozen before each PPO step, so that advantage is measured against a stable baseline. MVCC does the same: a transaction freezes reality and operates inside it until COMMIT.
**What Snapshot Isolation cannot catch:** write skew. Two transactions read disjoint snapshots, both write something mutually exclusive, both commit. Classic case - two doctors taking themselves off call, each seeing 'the other is on duty'. Postgres fixes this through SERIALIZABLE (SSI - Serializable Snapshot Isolation), Michael Cahill, 2008.
What is the fundamental difference between REPEATABLE READ and READ COMMITTED in Postgres?
VACUUM: The Sweeper of Dead Versions
Every UPDATE leaves a corpse of the old version. Every DELETE too. Without cleanup the table balloons. In a real project a 100 GB heap can be half-dead - that is bloat. Queries slow down because indexes point at empty slots.
VACUUM scans the heap, finds versions whose `xmax` is committed and below every active snapshot, and marks those slots free. It does not physically delete - it returns space to the free space map. The next INSERT will reuse those slots. Same idea as mark-and-sweep GC in Java or generational GC in V8.
Autovacuum is the daemon that runs VACUUM automatically, watching the dead tuple counter. Same logic as gradient checkpointing - do not free immediately, wait until a critical mass accumulates, then release it in one go.
**The single most common production crash:** autovacuum cannot keep up with write load. Bloat grows, performance drops, somebody runs `VACUUM FULL` during peak hours and locks the table for an hour. GitLab 2017, Sentry 2019, dozens of postmortems with the same root cause.
Why does regular VACUUM not return space to the operating system?
Transaction ID Wraparound: A Bomb in 32 Bits
Postgres identifies transactions with a 32-bit counter. 4.3 billion values sounds like a lot. It is not. At 10 thousand transactions per second the counter overflows in five days.
On overflow, transaction T+1 receives an ID numerically smaller than every existing xmin. Without protection, every row in the database would suddenly look 'from the future' - invisible. The database would brick itself in milliseconds.
That is why VACUUM also does a second invisible job: *freezing*. When a row's xmin grows older than the freeze horizon (default 200 million), it is replaced with a special FrozenXID marker - meaning 'visible to everyone, forever'. Frozen rows no longer belong to any specific transaction.
**Sentry, 2015:** monitoring missed it, autovacuum fell behind, wraparound was 11 million transactions away. Postgres went into read-only mode in production. Scandal, postmortem, now `autovacuum_freeze_max_age` is on every DBA's dashboard.
MVCC is just a read-performance trick - nothing special happens under the hood
MVCC is a trade-off: cheap reads at the price of disk bloat and mandatory background GC
Every UPDATE creates a new version, every 'plain' read goes through xmin/xmax/snapshot visibility checks. Without VACUUM the database dies of bloat, without freeze it dies of wraparound. MVCC is not free - its cost is deferred and paid by background processes, the same way amortized cost works in algorithm analysis
What happens to a Postgres database if the transaction counter overflows without timely freezing?
Related topics
Where MVCC leads next:
- Transactions and ACID — MVCC implements Isolation without read locks
- Locks — MVCC removes read locks, but writers still block writers
- Distributed transactions — CockroachDB and Spanner extend MVCC to the distributed case via HLC
- Vector clocks — xmin/xmax versions are a special case of logical clocks inside one DB
Key ideas
- MVCC: writers do not block readers because multiple row versions coexist
- xmin/xmax stamp birth and death of a row, deciding visibility per transaction
- Snapshot Isolation - each transaction sees the data as of its BEGIN moment
- VACUUM reclaims dead versions, autovacuum runs in the background - without it bloat eats production
- TXID wraparound is a 32-bit bomb and forces freezing of old rows as preventive maintenance
Вопросы для размышления
- Which operations in a real application produce the most bloat and why?
- Which isolation level fits financial transfers and why is READ COMMITTED inadequate?
- What would MVCC look like in a distributed DB where one node's xmin means nothing to another?
Связанные уроки
- db-13-transactions — MVCC lives inside the transactional ACID model
- db-15-locks — MVCC replaces read locks - direct comparison of approaches
- db-03-acid — Isolation is the I in ACID, MVCC is one way to implement it
- os-05-sync — Linux kernel RCU uses the same trick of versions instead of locks
- ds-08-vector-clocks — Row versions are a local cousin of logical clocks inside one DB
- aie-15-conversation-memory — Agent state snapshots use the same versioning approach