Databases

Indexes: B-Tree and How They Work

Цели урока

Understand B+ Tree structure and why search is O(log n)
Apply the prefix rule when designing composite indexes
Create covering indexes to enable Index Only Scans
Diagnose and fix index bloat

A query without an index on a 50-million-row table takes 8 seconds. The same query with the right index takes 2 milliseconds. B-Tree transforms O(n) into O(log n), but only when the index is designed and maintained correctly.

**E-commerce:** index on (user_id, status) speeds up order history queries
**Analytics:** covering index lets aggregations run without touching the heap
**High-write systems:** fillfactor and bloat monitoring prevent performance degradation
**Debugging:** EXPLAIN ANALYZE reveals which scan type is chosen and why

B-Tree Structure

An `orders` table with 50 million rows. A query `WHERE user_id = 42` without an index reads all 50 million rows - a full sequential scan. With a B-Tree index the same query performs roughly 25 reads. The difference is from minutes to milliseconds.

**B-Tree (Balanced Tree)** is a balanced search tree where each node holds multiple keys and child pointers. In PostgreSQL a B-Tree is implemented as a B+ Tree: all data resides only in leaf nodes, internal nodes store only keys for navigation.

Leaf pages are linked in a doubly-linked list - the defining property of a B+ Tree. This enables efficient **range queries** (`WHERE id BETWEEN 100 AND 200`): locate the first leaf, then follow the chain without returning to the root.

**B-Tree height in PostgreSQL:** one page = 8 KB, each page holds hundreds of keys. Height for 50 million rows = 3-4 levels. Any search costs 3-4 page reads.

Why are leaf pages in a B+ Tree connected in a linked list?

Composite Indexes and Column Order

A query with `WHERE status = 'active' AND created_at > '2024-01-01'` has two conditions. Two separate indexes could be created, but PostgreSQL will likely pick only one. A **composite index** on `(status, created_at)` covers both conditions in a single scan.

The fundamental rule is the **prefix rule**. An index on `(a, b, c)` supports queries on `(a)`, `(a, b)`, and `(a, b, c)`. A query filtering only on `(b)` or `(c)` cannot use the index - there is no entry point into the tree.

**Cardinality of the first column** matters for index design. A column with high cardinality (many distinct values) as the leading key filters data more aggressively. When queries always filter on both columns, the ordering matters less.

**How many indexes?** Every index slows INSERT/UPDATE/DELETE (the B-Tree must be updated). An index should be created for a specific slow query, not speculatively. Check `pg_stat_user_indexes` - unused indexes are visible there.

An index exists on (country, city, street). Which query uses it?

Covering Index

A normal index returns a heap tid, after which the database fetches the actual row from the table - a **heap fetch**. Each heap fetch is a random disk read. With many matching rows this becomes expensive. A **Covering Index** stores all columns the query needs directly in the index, eliminating heap fetches entirely.

PostgreSQL creates a covering index with `INCLUDE`. Columns in INCLUDE do not participate in sorting or search; they are stored as payload in the leaf pages only.

**Visibility Map:** Index Only Scan performs best when a page is marked all-visible in the visibility map (all rows visible to all transactions). Otherwise PostgreSQL still visits the heap for visibility checks. VACUUM updates the visibility map.

How does INCLUDE differ from adding a column to the index key?

Index-Only Scan

**Index Scan** vs **Index Only Scan** is a fundamental distinction. Index Scan: find tid in the index, fetch the row from heap. Index Only Scan: all needed data is in the index, the heap is never touched. Under high load Index Only Scan can be 10-100x faster.

**Bitmap Index Scan** is a fourth type. It applies when a query returns many rows (low selectivity, but the index still beats a seq scan). PostgreSQL first builds a bitmap of matching pages from the index, then reads heap pages in physical order - reducing random I/O.

Query: SELECT amount FROM orders WHERE order_id = 42. Index: PRIMARY KEY (order_id) INCLUDE (amount). Which scan is chosen?

Index Bloat and Maintenance

After a million UPDATEs and DELETEs the index has grown to 8 GB while the table is only 2 GB. This is **index bloat** - dead row versions accumulate in B-Tree pages and are not automatically released. Queries slow down because they read pages full of stale entries.

The root cause is MVCC: old row versions are not deleted immediately. VACUUM removes dead tuples from the heap, but in B-Tree the pages are only marked available for reuse - the file does not shrink and space is not returned to the OS.

**Fillfactor** proactively limits bloat on UPDATE-heavy tables. By default B-Tree fills pages to 100%. With fillfactor=70, 30% of each page is reserved - an UPDATE can store the new row version on the same page (HOT update), avoiding a new index entry.

**Autovacuum and indexes:** autovacuum removes dead tuples and updates the visibility map. When autovacuum cannot keep up with write load, index bloat grows. Monitor `pg_stat_user_tables.n_dead_tup` - if it exceeds 10-20% of live rows, run VACUUM manually.

After REINDEX CONCURRENTLY, the index shrank from 8 GB to 1 GB. What does this indicate?

B-Tree Indexes

B+ Tree: data in leaf pages linked as a list for range queries
Height is 3-4 levels for 50 million rows - any lookup costs ~4 page reads
Composite index prefix rule: (a) and (a,b) work, (b) alone does not
INCLUDE creates a covering index: columns stored in leaves, not used for search
Index Only Scan: heap is not accessed when all columns are in the index
Index bloat: VACUUM cleans the heap but does not shrink B-Tree pages - REINDEX is needed

Вопросы для размышления

How should column order in a composite index be chosen when different queries use different combinations?
In which situations can Index Only Scan be slower than a regular Index Scan?
How frequently should unused indexes be audited in a production database?