PostgreSQL

Storage: Pages, Tuples, TOAST

Every row inserted into PostgreSQL is more than just data. It's a structured object with a 23-byte header, a physical address, and references to transactions. Every UPDATE creates a fresh copy. Every large field transparently moves to a separate table. Understanding this layer explains why databases bloat, why UPDATE is slower than INSERT, and how to tune storage for a specific workload.

  • **GitLab (2021):** the merge_request_diffs table grew to 4TB because of large JSON diffs stored with EXTENDED. Switching the binary_data column to EXTERNAL plus moving to external object storage cut the table down to 800GB.
  • **Shopify:** the orders table with billions of rows runs with fillfactor=70. That produced 85% HOT updates on the status column, removing pressure on indexes during the Black Friday surge of bulk status updates.
  • **Notion (2022):** as the database grew to 10TB, the blocks table reached 30% dead tuples due to frequent content updates. Tuning the per-table autovacuum scale factor from 20% to 5% brought bloat down to 3%.

The 8KB Page: Anatomy of Storage

PostgreSQL stores all data in fixed-size 8192-byte pages (8KB). Every table or index file is a sequence of these pages. Understanding the page layout explains why MVCC produces bloat and how heap-only tuple updates work.

The 8KB page size was chosen at PostgreSQL's inception as a compromise between efficiency for large objects and overhead for small ones. It can only be changed at compile time (--with-blocksize). RDS and most managed providers use the standard 8KB.

In which direction do item pointers and tuples grow inside a PostgreSQL page?

Heap Tuples: What Lives Inside Every Row

Every table row is a heap tuple. A tuple consists of a header (HeapTupleHeader, 23 bytes) and the column data. The header holds all MVCC information: xmin, xmax, ctid.

PostgreSQL doesn't update rows in place. UPDATE creates a new tuple with the new data, and the old one is flagged as dead (t_xmax is filled in). This is the foundation of MVCC: readers see the old version until the writing transaction commits. The price is bloat: every UPDATE leaves a dead tuple behind until the next VACUUM.

Why is the old tuple after UPDATE flagged as dead (t_xmax is filled in) instead of being deleted right away?

TOAST: Large Values Outside the Page

A tuple cannot be larger than a page (8KB). But columns like text, bytea, jsonb can hold gigabytes. TOAST (The Oversized-Attribute Storage Technique) solves this: large values are stored in a separate toast table, and the main row holds only a pointer.

GitLab stores git blobs in PostgreSQL via TOAST. Performance analysis showed that TOAST with the EXTENDED strategy spent ~15% of CPU on compressing/decompressing binary data. Switching to EXTERNAL for the blob_data column dropped CPU by 8% on the same workload.

When should you pick EXTERNAL over EXTENDED for a TOAST column?

Fillfactor: Room for HOT Updates

Fillfactor sets the percentage of a page that gets filled on INSERT. The remaining space is reserved for UPDATE. This matters for HOT (Heap Only Tuple) updates, the mechanism that lets PostgreSQL update a row without touching indexes.

Table orders with fillfactor=100 shows 0% HOT updates. Why would lowering fillfactor to 70 help?

ctid and TID Scan: Direct Access to a Tuple

ctid (current tuple ID) is the physical address of a row in the form (page, offset). It's the only built-in way to address a specific tuple on a specific page directly, bypassing any index.

ctid drives the HOT chain: if a row was updated several times, older tuples link via t_ctid to the next version. PostgreSQL walks the chain to find the live tuple. VACUUM clears dead links from the chain.

TOAST is always slower than storing data inline in the main table

TOAST with the EXTENDED strategy is often faster for large values: compressed data takes less space, which reduces I/O on read. If queries only touch small columns (id, status) and never read the big field, TOAST is a win: the main page is smaller and holds more rows.

TOAST performance depends on the access pattern. If almost every query reads the big field, TOAST adds an extra lookup into pg_toast. If the big field is rarely read, TOAST makes the main table more compact and speeds up queries over other columns.

What happens to a row's ctid after VACUUM FULL?

Key Ideas

  • **8KB page is the unit of I/O:** everything lives in pages. Item pointers and tuples grow toward each other. The gap between them is the budget for new rows.
  • **UPDATE = new tuple:** MVCC demands immutable tuples. Every UPDATE creates a new copy and leaves a dead one behind until VACUUM. fillfactor=70-80 reserves room for HOT updates and skips index writes.
  • **TOAST is transparent but not free:** values over 2KB automatically move to a separate table. Storage strategy (EXTENDED/EXTERNAL) is worth tuning per data type.

Related Topics

The storage layer connects to several higher-level mechanisms:

  • MVCC and Row Visibility — xmin/xmax in HeapTupleHeader is the physical implementation of MVCC. Dead tuples from UPDATE are a direct consequence of the MVCC architecture
  • VACUUM and bloat — VACUUM removes dead tuples from heap pages and reclaims free space. Bloat is accumulated dead tuples without timely VACUUM
  • Buffer Cache and Shared Buffers — Table pages are cached in shared_buffers. The buffer manager operates on 8KB pages: the caching unit matches the storage unit

Вопросы для размышления

  • The events table has 100M rows, 80% UPDATE (setting processed=true). How do fillfactor, autovacuum, and HOT updates interact in this scenario? What fillfactor would be optimal?
  • Why can an index on a JSONB column with large documents be much smaller than the table itself? How does TOAST affect index size?
  • Can ctid be used for pagination (keyset pagination by physical order)? What risks come with that approach?

Связанные уроки

  • db-09-indexes-btree
Storage: Pages, Tuples, TOAST

0

1

Sign In