PostgreSQL
Planner: Statistics and Cost Estimation
The planner approaches a query like a detective: it has clues (statistics) but not the full picture. Sometimes it draws conclusions that are 10000% wrong, and then a 200ms query turns into a 20-minute query.
- **Stripe** found a payments * customers JOIN plan that was 500x off because of a Zipf distribution of transactions (the top 0.1% of merchants generated 60% of payments). Extended statistics on (merchant_id, status) fixed the estimate and query time dropped from 45s to 1.2s.
- **GitLab** documented an incident in 2021: after a large issue batch import, stale stats led the planner to pick Nested Loop instead of Hash Join for a notes * issues JOIN. The DB degraded for 40 minutes until a DBA ran ANALYZE manually.
- **Heroku Postgres** auto-raises statistics_target to 500 for foreign key columns (join keys) when an index is created via the UI. The heuristic: FK columns are the most common participants in JOINs, where estimate errors hurt the most.
- **Notion**, while sharding blocks by workspace_id, hit a partition pruning bug caused by stale stats on a partitioned table. Setting autovacuum_analyze_scale_factor = 0.01 on large partitions prevented a recurrence.
pg_statistic: Anatomy of Statistics
PostgreSQL stores data distribution statistics in the system table `pg_statistic` (or its readable view `pg_stats`). This data is collected by `ANALYZE`, and the planner uses it to estimate row counts (cardinality), pick a JOIN algorithm, and order operations.
By default ANALYZE collects stats into 100 slots (default_statistics_target = 100). That means histogram_bounds has 101 boundaries: enough for uniform distributions, but limited for skewed ones. Raising it to 500 improves accuracy at the cost of ANALYZE time and pg_statistic size.
In pg_stats for user_id, n_distinct = -0.95. What does it mean?
Histograms and Most Common Values
The planner uses two mechanisms for selectivity estimation: **MCV (Most Common Values)** for frequent values, and **histogram** for the rest of the distribution. For `WHERE status = 'pending'` it first checks MCV; if 'pending' is there, it uses the exact frequency. If not, it interpolates over the histogram.
The histogram stores **equi-depth** buckets, not equal-width ones. Every bucket holds roughly the same number of rows. For `WHERE amount BETWEEN 100 AND 200` the planner counts the fraction of buckets that fall into the range.
MCV for 'country' has 10 values covering 98% of rows. A query filters by a rare country not in MCV. How does the planner estimate row count?
Correlation: Physical Order of Data
`correlation` in pg_stats shows how much the physical order of rows in the heap matches the logical order of the column's values. Range [-1, 1]: 1.0 means rows are physically ordered ascending, -1 descending, 0 chaotic. This drives the cost of Index Scan.
CLUSTER physically reorders a table by an index, lifting correlation to ~1.0. After CLUSTER on orders(user_id), Index Scan by user_id becomes an order of magnitude more efficient. But CLUSTER locks the table, and the effect erodes as new rows are inserted.
An index on `updated_at` exists, but the planner picks Seq Scan for `WHERE updated_at > NOW() - INTERVAL '1 day'`. correlation = 0.03. Why?
Extended Statistics
Standard statistics are collected per column, independently. That breaks down on conditions over multiple correlated columns: `WHERE city = 'Moscow' AND country = 'Russia'`. The planner multiplies selectivities (0.1 * 0.5 = 0.05) without knowing city depends on country. Real selectivity is 0.5, a 10x miss. PostgreSQL 10+ fixes this via `CREATE STATISTICS`.
Extended statistics support three kinds: `ndistinct` (joint cardinality across columns), `dependencies` (functional dependencies, e.g., city depends on country), `mcv` (Most Common Values for combinations). PostgreSQL 14+ added OR-condition support to extended statistics.
Query `WHERE region = 'EU' AND country = 'Germany'` shows estimate 50 rows vs actual 12000. What helps?
Cardinality Estimation for JOINs
For a two-table JOIN the planner estimates the result size as `rows_A * rows_B / max(ndistinct_A, ndistinct_B)`. This assumes every value of the join key in A matches every value in B (the minmax principle). Errors here cascade: a wrong JOIN estimate changes the algorithm choice and the operation order.
PostgreSQL 16 improved join estimation via Incremental Sort and better use of extended statistics in multi-join queries. If estimates are still bad after all tuning, `pg_hint_plan` can force a specific algorithm, but that is a stopgap, not a permanent fix.
ANALYZE is enough for accurate estimates; if stats are fresh, the planner always picks the right plan
ANALYZE collects stats from a sample (not all rows), independently per column. Correlated columns and data skew still produce wrong estimates
default_statistics_target=100 means a sample of about 30000 rows for ANALYZE. On a 500M-row table that is 0.006%. Uniform distributions are estimated well, but skewed ones (Zipf user activity) can be off by 100x even with fresh stats. Extended statistics and a larger statistics_target reduce, but do not eliminate, the issue.
JOIN orders * users gives estimate 500, actual 250000. Stats are fresh. Most likely cause?
Key Ideas
- **pg_statistic** stores MCV (exact frequencies of top values) and a histogram (equi-depth buckets) per column; ANALYZE refreshes it from a sample.
- **correlation** shows the physical ordering of data; at correlation ≈ 0 Index Scan is expensive (lots of random I/O), and the planner prefers Seq Scan.
- **Extended Statistics** (CREATE STATISTICS) solves the correlated-columns problem by teaching the planner about dependencies between them.
- **Join estimation** misfires on data skew and stale stats; cascading estimate errors can turn a 200ms query into a 20-minute one.
Related Topics
Understanding the planner ties several PostgreSQL topics together:
- EXPLAIN ANALYZE — The only way to see the estimate vs actual gap; diagnosing bad stats starts here
- JOIN Algorithms — Cardinality estimates directly drive the choice between Hash Join, Nested Loop, and Merge Join
- VACUUM and autovacuum — autovacuum runs ANALYZE to keep stats fresh; autovacuum_analyze_scale_factor settings are critical
Вопросы для размышления
- Are there tables in your schema with heavily skewed distributions (e.g., top-user activity), and how does that affect the planner's accuracy?
- Which column pairs in your schema most likely have functional dependencies, and is it worth creating extended statistics for them?
- Is autovacuum configured so that ANALYZE runs often enough on insert-heavy tables?