PostgreSQL
BRIN Indexes for Large Tables
A logs table: 500GB, 2 billion rows. A B-Tree index on timestamp would take 40GB. BRIN takes 5MB. For the query 'give me yesterday's logs', B-Tree reads the index then heap pages. BRIN reads 5MB of metadata and skips 99.9% of blocks. For append-only data, this is a revolutionary difference.
- **TimescaleDB**: a PostgreSQL extension for time-series. Uses BRIN + partitioning internally. Handles trillions of metric rows. Airbus, Siemens, Cisco use it for IoT monitoring
- **Cloudflare**: PostgreSQL with BRIN for storing DNS logs. 200+ million requests/sec. BRIN on timestamp enables date-range queries without 100GB B-Tree indexes
- **Snowflake / Parquet**: zone maps - the same min/max per data chunk principle. BRIN implements zone maps for PostgreSQL heap files
BRIN: An Index That Only Knows Min and Max
2016. PostgreSQL 9.5. A new index type: BRIN (Block Range INdex). The idea is strikingly simple: instead of recording every key, store just the [min, max] range for each range of physical heap blocks. An index on a 1TB table takes a few megabytes. This is only possible under one condition: the data is physically correlated with the index key.
B-Tree stores one pointer per row -> N entries. BRIN stores one [min,max] per K blocks -> N/K/8192 entries. For a table with 100M rows and pages_per_range=128: B-Tree ~2GB, BRIN ~1MB. A difference of thousands of times. But BRIN only works as a block filter: it reads all blocks where max >= query_val AND min <= query_val.
Analytics parallel: zone maps in columnar stores (Parquet, ORC, Snowflake). Each row group stores min/max metadata - groups outside the query range are skipped during reads. BRIN implements the same principle in PostgreSQL for heap tables. This is why ClickHouse (with a sparse primary index) and Snowflake are efficient on time-series without B-Tree.
Why does a BRIN index take thousands of times less space than B-Tree on the same table?
pages_per_range: Tuning BRIN Granularity
The key BRIN parameter: `pages_per_range` (default: 128). It determines how many heap pages one index entry covers. Smaller value -> more entries in BRIN -> better selectivity -> more blocks skipped -> but the index itself is larger. Trade-off: precision vs size.
Correlation is the key metric for BRIN. `pg_stats.correlation` for a column shows how well the physical row order in the heap matches the logical order of values. correlation = 1.0 -> perfect correlation, BRIN is maximally effective. correlation = 0 -> random order, BRIN is useless. For auto-increment id and timestamp, correlation is typically 0.95-1.0.
Reducing pages_per_range from 128 to 32 - how does BRIN behavior change?
Time-Series: The Classic Case for BRIN
An events, metrics, or logs table - append-only, with a monotonically increasing timestamp. Each INSERT adds rows to the end of the file -> timestamp correlation with physical location is approximately 1.0. BRIN fits perfectly: the [min,max] of each block range does not overlap with others, and date-range queries skip 99%+ of blocks.
Partitioning + BRIN is the standard combination for time-series in production. A partition constrains the date range; BRIN within the partition provides efficient search by day/hour. Partition pruning removes unnecessary partitions; BRIN removes unnecessary blocks within a partition. Two levels of block filtering without B-Tree overhead.
Why does BRIN on recorded_at work well for an IoT table with append-only inserts?
BRIN vs B-Tree: When to Choose Which
BRIN is not a replacement for B-Tree, but a complement for a specific use case. B-Tree is optimal for point queries (WHERE id = 42); BRIN is optimal for range queries on physically correlated columns. The choice is determined by two factors: correlation and query selectivity.
What happens to a BRIN index on created_at when mass UPDATEs are run on random rows?
BRIN autosummarize and Index Maintenance
A BRIN index needs maintenance. When new blocks are INSERTed, a summary is created automatically if `autosummarize = on` (default in PG 10+). But with large batch inserts, the summary may lag. `brin_summarize_new_values()` forces summary creation for unsummarized ranges.
BRIN and vacuum: VACUUM marks dead tuples but does not move rows - correlation is not restored. CLUSTER (blocking) and pg_repack (online) physically reorder the heap, restoring correlation to approximately 1.0 and BRIN effectiveness. For append-only time-series tables, CLUSTER is not needed: data is always inserted at the end.
BRIN works like a sparse B-Tree - it skips some rows but precisely finds the needed ones
BRIN is a lossy index: it narrows the list of candidate blocks but always requires a Recheck. The executor reads all rows from matching blocks and re-applies the filter
In EXPLAIN ANALYZE, 'Rows Removed by Index Recheck' is always present. BRIN says: 'here are blocks where the value MIGHT be'. B-Tree says: 'here are the exact rows'. This is why BRIN is inefficient for very narrow range queries with high selectivity
What does `brin_summarize_new_values()` do and when is it needed?
Related Topics
BRIN operates at the physical storage layer:
- B-Tree Indexes — Baseline index for comparison: point queries vs range queries on correlated data
- Table Partitioning — BRIN + partitioning is the standard combination for time-series tables
- Index Maintenance — BRIN requires understanding bloat, correlation, and cluster for effective maintenance
Key Ideas
- **BRIN = zone map**: [min,max] per pages_per_range blocks. Thousands of times smaller than B-Tree; works only with high correlation.
- **Lossy index**: BRIN always requires Recheck. It skips blocks, not rows - reads all rows from matching blocks.
- **pages_per_range**: reducing it improves selectivity (fewer unnecessary blocks) but increases index size. Default=128.
- **Ideal case**: append-only timestamp with correlation approximately 1.0. Bad case: random UPDATEs or point lookups.
- **Partitioning + BRIN**: two levels of block filtering for time-series without B-Tree overhead on terabyte-scale tables.
Вопросы для размышления
- BRIN on a UUID column (random v4) will be useless. How to check correlation before creating the index, and what to do if it is low?
- VACUUM FULL is run monthly on a table with BRIN on timestamp. Does this restore correlation? Is there an alternative without locking?
- With a batch insert of 10M rows at once, autosummarize may not keep up. How to integrate brin_summarize_new_values into an ETL pipeline?
Связанные уроки
- pg-15-gin-gist — GIN/GiST specialized indexes are the preceding step in the series
- pg-12-btree — B-Tree is the baseline index to compare BRIN against
- pg-34-partitioning — Partitioning + BRIN is the classic combination for time-series
- pg-17-index-maintenance — After BRIN, understanding index maintenance (bloat, vacuum) is the next step
- pg-29-storage — BRIN operates on the physical layout of data in heap pages
- alg-21-dp — BRIN summary per block range is analogous to range minimum query in DP
- db-10-indexes-advanced