PostgreSQL

MVCC: xmin, xmax, Tuple Versioning

A SELECT reads 1000 rows. At the same time an UPDATE changes 500 of them. The SELECT does not see the change and does not wait. The UPDATE does not wait for the SELECT. Both run in parallel without a single lock. That is MVCC, the foundation of PostgreSQL's high concurrency.

  • **Instagram** (Postgres before the Cassandra migration) handled 25M+ reads/sec concurrently with continuous writes on the media table. Without MVCC such concurrency would require read locks and cut throughput by 80%.
  • **Notion** runs bulk background migrations (changing thousands of blocks) while production reads stay unaffected. MVCC hands users old block versions until the migration commits.
  • **Heroku Postgres** monitors n_dead_tup via pg_stat_user_tables. A sharp rise (>20% of live) signals lagging autovacuum and growing MVCC bloat.

Tuple Versioning: a Row as a List of Versions

In PG an UPDATE does not overwrite the row in place. Instead the old version (tuple) is marked as deleted and a new one is inserted next to it, in the same heap file. Both versions physically exist at the same time. This lets different transactions see different versions of one row without read locks.

After an UPDATE on a row in PG, how many physical versions of that row exist in the heap before VACUUM?

xmin and xmax: a Row Version's Passport

Every heap tuple has two system fields. xmin is the transaction ID that created this version (the INSERT or the UPDATE that produced it). xmax is the transaction ID that deleted this version (the DELETE or the UPDATE that replaced it). If xmax = 0 the row is still alive. Visibility for a transaction is determined by: xmin is committed AND xmin is in the snapshot AND (xmax = 0 OR xmax is not yet committed OR xmax is outside the snapshot).

Transaction IDs in PG are 32-bit counters. When txid reaches 2^32 it wraps around: old txids can look as if they are 'in the future' relative to current ones. VACUUM FREEZE marks old rows with the special FrozenTransactionId so they remain visible regardless of wraparound.

A row has xmin=500, xmax=600. Transaction T700 with snapshot (xmin=550, xmax=700) reads this row. Is it visible?

Transaction Snapshots: What a Transaction Sees

A snapshot describes database state: `xmin:xmax:xip`. xmin = the oldest active txid. xmax = the next txid not yet assigned. xip = the list of active (in-progress) txids between xmin and xmax. A transaction with txid < snapshot.xmin that is committed is visible. A transaction with txid >= snapshot.xmax is not visible (it has not started yet).

pg_export_snapshot() exports the snapshot as a string that another transaction can import via SET TRANSACTION SNAPSHOT '...'. This lets two transactions work against the same view of data, useful for parallel dumps or replication.

Snapshot '100:105:101,103'. Is the transaction with txid=102 visible?

pg_xact (CLOG): the Transaction Status Registry

pg_xact (formerly CLOG, Commit Log) is a bitmap that stores the status of each transaction: in-progress, committed, aborted, sub-committed. Two bits per transaction. Files live in `$PGDATA/pg_xact/`. Each file is 256 KB and covers about 1M transactions. VACUUM periodically removes old files.

Hint bits are an optimization. On first access to a tuple PG may read pg_xact and write the result directly into the tuple header (t_infomask). After that no further pg_xact lookup is needed. Hint bits explain why a SELECT can produce dirty pages in shared_buffers.

Why can a SELECT mark pages in shared_buffers as dirty?

Visibility Check: the Algorithm

When reading a heap tuple PG calls HeapTupleSatisfiesVisibility, a snapshot check. Algorithm: verify that xmin is committed and xmin < snapshot.xmax and xmin is not in xip, then verify that xmax = 0 or xmax is not committed or xmax > snapshot.xmin. This runs for every tuple in a Seq Scan: one more argument in favor of indexes.

The visibility map is a bitmap file `$PGDATA/base/{dboid}/{relfilenode}_vm`. One bit per page. If the bit is 1, every tuple on the page is visible to any transaction. VACUUM skips that page and Index Only Scan does not visit the heap.

MVCC blocks readers until the writing transaction commits

In PG readers never block writers and vice versa. Each transaction works against its own snapshot. Old tuple versions stay visible to readers as long as MVCC requires.

That is exactly the value of MVCC: high read-write concurrency without read locks. The price is dead tuples accumulating, which VACUUM must clean up.

An Index Only Scan reports Heap Fetches: 0. What does that mean?

Key Ideas

  • UPDATE does not overwrite a row. It creates a new tuple with xmin=txid and marks the old one xmax=txid. Both versions stay in the heap until VACUUM.
  • Snapshot = xmin:xmax:xip, a description of 'what is committed up to this moment'. Each tuple's visibility is checked against the snapshot via pg_xact.
  • The visibility map lets Index Only Scan skip heap lookups for all-visible pages. Heap Fetches: 0 in EXPLAIN ANALYZE is the proof.

Related Topics

MVCC is the foundation for several other PG mechanisms:

  • VACUUM and garbage collection — VACUUM removes dead tuples (old MVCC versions) that no snapshot needs any more. Without VACUUM the heap grows without bound.
  • Transaction isolation levels — Read Committed takes a snapshot per statement; Repeatable Read takes one per transaction. The visibility algorithm is identical; only the snapshot timing differs.
  • Indexes and Index Only Scan — The visibility map (part of MVCC) lets Index Only Scan skip the heap. A key optimization for covering indexes.

Вопросы для размышления

  • If a transaction runs for an hour (an analytical query, for example), which tuple versions can VACUUM not remove during that time? How does this affect bloat?
  • Hint bits are written into heap pages by SELECTs. Why does this not break the guarantee that readers do not block writers?
  • txid wraparound: what happens if you never run VACUUM FREEZE and the transaction counter overflows?

Связанные уроки

  • db-03-acid
MVCC: xmin, xmax, Tuple Versioning

0

1

Sign In