PostgreSQL

BRIN Indexes for Large Tables

A logs table: 500GB, 2 billion rows. A B-Tree index on timestamp would take 40GB. BRIN takes 5MB. For the query 'give me yesterday's logs', B-Tree reads the index then heap pages. BRIN reads 5MB of metadata and skips 99.9% of blocks. For append-only data, this is a revolutionary difference.

**TimescaleDB**: a PostgreSQL extension for time-series. Uses BRIN + partitioning internally. Handles trillions of metric rows. Airbus, Siemens, Cisco use it for IoT monitoring
**Cloudflare**: PostgreSQL with BRIN for storing DNS logs. 200+ million requests/sec. BRIN on timestamp enables date-range queries without 100GB B-Tree indexes
**Snowflake / Parquet**: zone maps - the same min/max per data chunk principle. BRIN implements zone maps for PostgreSQL heap files

BRIN: An Index That Only Knows Min and Max

2016. PostgreSQL 9.5. A new index type: BRIN (Block Range INdex). The idea is strikingly simple: instead of recording every key, store just the [min, max] range for each range of physical heap blocks. An index on a 1TB table takes a few megabytes. This is only possible under one condition: the data is physically correlated with the index key.

B-Tree stores one pointer per row -> N entries. BRIN stores one [min,max] per K blocks -> N/K/8192 entries. For a table with 100M rows and pages_per_range=128: B-Tree ~2GB, BRIN ~1MB. A difference of thousands of times. But BRIN only works as a block filter: it reads all blocks where max >= query_val AND min <= query_val.

Analytics parallel: zone maps in columnar stores (Parquet, ORC, Snowflake). Each row group stores min/max metadata - groups outside the query range are skipped during reads. BRIN implements the same principle in PostgreSQL for heap tables. This is why ClickHouse (with a sparse primary index) and Snowflake are efficient on time-series without B-Tree.

Why does a BRIN index take thousands of times less space than B-Tree on the same table?

pages_per_range: Tuning BRIN Granularity

The key BRIN parameter: `pages_per_range` (default: 128). It determines how many heap pages one index entry covers. Smaller value -> more entries in BRIN -> better selectivity -> more blocks skipped -> but the index itself is larger. Trade-off: precision vs size.

Correlation is the key metric for BRIN. `pg_stats.correlation` for a column shows how well the physical row order in the heap matches the logical order of values. correlation = 1.0 -> perfect correlation, BRIN is maximally effective. correlation = 0 -> random order, BRIN is useless. For auto-increment id and timestamp, correlation is typically 0.95-1.0.

Reducing pages_per_range from 128 to 32 - how does BRIN behavior change?

Time-Series: The Classic Case for BRIN

An events, metrics, or logs table - append-only, with a monotonically increasing timestamp. Each INSERT adds rows to the end of the file -> timestamp correlation with physical location is approximately 1.0. BRIN fits perfectly: the [min,max] of each block range does not overlap with others, and date-range queries skip 99%+ of blocks.

Partitioning + BRIN is the standard combination for time-series in production. A partition constrains the date range; BRIN within the partition provides efficient search by day/hour. Partition pruning removes unnecessary partitions; BRIN removes unnecessary blocks within a partition. Two levels of block filtering without B-Tree overhead.

Why does BRIN on recorded_at work well for an IoT table with append-only inserts?

BRIN vs B-Tree: When to Choose Which

BRIN is not a replacement for B-Tree, but a complement for a specific use case. B-Tree is optimal for point queries (WHERE id = 42); BRIN is optimal for range queries on physically correlated columns. The choice is determined by two factors: correlation and query selectivity.

What happens to a BRIN index on created_at when mass UPDATEs are run on random rows?

BRIN autosummarize and Index Maintenance

A BRIN index needs maintenance. When new blocks are INSERTed, a summary is created automatically if `autosummarize = on` (default in PG 10+). But with large batch inserts, the summary may lag. `brin_summarize_new_values()` forces summary creation for unsummarized ranges.

BRIN and vacuum: VACUUM marks dead tuples but does not move rows - correlation is not restored. CLUSTER (blocking) and pg_repack (online) physically reorder the heap, restoring correlation to approximately 1.0 and BRIN effectiveness. For append-only time-series tables, CLUSTER is not needed: data is always inserted at the end.

BRIN works like a sparse B-Tree - it skips some rows but precisely finds the needed ones

BRIN is a lossy index: it narrows the list of candidate blocks but always requires a Recheck. The executor reads all rows from matching blocks and re-applies the filter

In EXPLAIN ANALYZE, 'Rows Removed by Index Recheck' is always present. BRIN says: 'here are blocks where the value MIGHT be'. B-Tree says: 'here are the exact rows'. This is why BRIN is inefficient for very narrow range queries with high selectivity

What does `brin_summarize_new_values()` do and when is it needed?

Key Ideas

**BRIN = zone map**: [min,max] per pages_per_range blocks. Thousands of times smaller than B-Tree; works only with high correlation.
**Lossy index**: BRIN always requires Recheck. It skips blocks, not rows - reads all rows from matching blocks.
**pages_per_range**: reducing it improves selectivity (fewer unnecessary blocks) but increases index size. Default=128.
**Ideal case**: append-only timestamp with correlation approximately 1.0. Bad case: random UPDATEs or point lookups.
**Partitioning + BRIN**: two levels of block filtering for time-series without B-Tree overhead on terabyte-scale tables.

Вопросы для размышления

BRIN on a UUID column (random v4) will be useless. How to check correlation before creating the index, and what to do if it is low?
VACUUM FULL is run monthly on a table with BRIN on timestamp. Does this restore correlation? Is there an alternative without locking?
With a batch insert of 10M rows at once, autosummarize may not keep up. How to integrate brin_summarize_new_values into an ETL pipeline?

Связанные уроки

pg-15-gin-gist — GIN/GiST specialized indexes are the preceding step in the series
pg-12-btree — B-Tree is the baseline index to compare BRIN against
pg-34-partitioning — Partitioning + BRIN is the classic combination for time-series
pg-17-index-maintenance — After BRIN, understanding index maintenance (bloat, vacuum) is the next step
pg-29-storage — BRIN operates on the physical layout of data in heap pages
alg-21-dp — BRIN summary per block range is analogous to range minimum query in DP
db-10-indexes-advanced

PostgreSQL

BRIN Indexes for Large Tables

**TimescaleDB**: a PostgreSQL extension for time-series. Uses BRIN + partitioning internally. Handles trillions of metric rows. Airbus, Siemens, Cisco use it for IoT monitoring
**Cloudflare**: PostgreSQL with BRIN for storing DNS logs. 200+ million requests/sec. BRIN on timestamp enables date-range queries without 100GB B-Tree indexes
**Snowflake / Parquet**: zone maps - the same min/max per data chunk principle. BRIN implements zone maps for PostgreSQL heap files

BRIN: An Index That Only Knows Min and Max

Why does a BRIN index take thousands of times less space than B-Tree on the same table?

pages_per_range: Tuning BRIN Granularity

Reducing pages_per_range from 128 to 32 - how does BRIN behavior change?

Time-Series: The Classic Case for BRIN

Why does BRIN on recorded_at work well for an IoT table with append-only inserts?

BRIN vs B-Tree: When to Choose Which

What happens to a BRIN index on created_at when mass UPDATEs are run on random rows?

BRIN autosummarize and Index Maintenance

BRIN works like a sparse B-Tree - it skips some rows but precisely finds the needed ones

BRIN is a lossy index: it narrows the list of candidate blocks but always requires a Recheck. The executor reads all rows from matching blocks and re-applies the filter

What does `brin_summarize_new_values()` do and when is it needed?

Key Ideas

**BRIN = zone map**: [min,max] per pages_per_range blocks. Thousands of times smaller than B-Tree; works only with high correlation.
**Lossy index**: BRIN always requires Recheck. It skips blocks, not rows - reads all rows from matching blocks.
**pages_per_range**: reducing it improves selectivity (fewer unnecessary blocks) but increases index size. Default=128.
**Ideal case**: append-only timestamp with correlation approximately 1.0. Bad case: random UPDATEs or point lookups.
**Partitioning + BRIN**: two levels of block filtering for time-series without B-Tree overhead on terabyte-scale tables.

Вопросы для размышления

BRIN on a UUID column (random v4) will be useless. How to check correlation before creating the index, and what to do if it is low?
VACUUM FULL is run monthly on a table with BRIN on timestamp. Does this restore correlation? Is there an alternative without locking?
With a batch insert of 10M rows at once, autosummarize may not keep up. How to integrate brin_summarize_new_values into an ETL pipeline?

Связанные уроки

pg-15-gin-gist — GIN/GiST specialized indexes are the preceding step in the series
pg-12-btree — B-Tree is the baseline index to compare BRIN against
pg-34-partitioning — Partitioning + BRIN is the classic combination for time-series
pg-17-index-maintenance — After BRIN, understanding index maintenance (bloat, vacuum) is the next step
pg-29-storage — BRIN operates on the physical layout of data in heap pages
alg-21-dp — BRIN summary per block range is analogous to range minimum query in DP
db-10-indexes-advanced

BRIN Indexes for Large Tables

BRIN: An Index That Only Knows Min and Max

pages_per_range: Tuning BRIN Granularity

Time-Series: The Classic Case for BRIN

BRIN vs B-Tree: When to Choose Which

BRIN autosummarize and Index Maintenance

Related Topics

Key Ideas

Вопросы для размышления

Связанные уроки

BRIN Indexes for Large Tables

BRIN: An Index That Only Knows Min and Max

pages_per_range: Tuning BRIN Granularity

Time-Series: The Classic Case for BRIN

BRIN vs B-Tree: When to Choose Which

BRIN autosummarize and Index Maintenance

Related Topics

Key Ideas

Вопросы для размышления

Связанные уроки