Databases

Query Tuning: From Slow to Fast

In 2016 Uber migrated from PostgreSQL to MySQL and listed accumulated slow queries as one reason. Most issues were diagnosed only after the fact: N+1 in the ORM, implicit casts, stale statistics after bulk loads. The tools to catch these problems existed all along - nobody was looking at pg_stat_statements.

**GitLab** publicly documented their query tuning process: every merge request automatically runs EXPLAIN on affected tables and blocks the merge if a Seq Scan appears on a table with more than 10k rows.
**Notion** discovered their main page was making 47 SQL queries due to N+1 in GraphQL resolvers. Switching to DataLoader reduced it to 3 queries and cut page load by 400 ms.
**pg_stat_statements** paired with Grafana is a standard monitoring stack: the dashboard refreshes every 60 seconds and surfaces degrading queries before users notice the problem.

Slow Query Antipatterns

90% of slow queries have one of five causes: a function applied to an indexed column in WHERE, a LIKE with a leading wildcard, SELECT *, OR conditions instead of UNION, or a correlated subquery. Each pattern kills indexes and forces the planner into a full Seq Scan.

**Index killers:** `WHERE UPPER(email) = 'TEST@EXAMPLE.COM'` - the function makes the index on `email` useless (need a functional index). `WHERE name LIKE '%smith%'` - leading wildcard means Seq Scan (need pg_trgm + GIN). `WHERE status != 'active'` - negative conditions rarely use indexes. `WHERE created_at::date = '2024-01-01'` - the cast makes the index on `created_at` useless.

The query `WHERE EXTRACT(YEAR FROM created_at) = 2024` does not use the index on `created_at`. How to rewrite it?

N+1: The Most Expensive ORM Mistake

N+1 occurs when loading a list of N objects requires N+1 queries: one for the list and one for each related record. At N=1000 that is 1001 round-trips to the database. Unnoticeable locally, but with 1 ms network latency in production that is 1 second spent solely on network round-trips.

**Signs of N+1 in logs:** the same query repeated with different parameters (`WHERE id = $1` with different values). **ORM solution (TypeORM):** `relations: ['author']` or `leftJoinAndSelect`. **SQL solution:** JOIN or `WHERE id IN (...)`. **For deep hierarchies:** recursive CTE (`WITH RECURSIVE`). **DataLoader pattern** (GraphQL): groups N requests into one batch per event loop tick.

TypeORM application logs show 500 identical queries `SELECT * FROM users WHERE id = ?` in one second. What is the cause?

Implicit Casts: When a Type Kills the Index

Implicit type casting is a treacherous trap: the query looks correct, returns the right data, but the index is not used. PostgreSQL is strictly typed: if the column is `bigint` and the parameter is `varchar`, the planner cannot apply the B-tree index without an explicit cast. ORM libraries and drivers sometimes pass parameters of the wrong type.

**Common cases:** column `user_id bigint`, parameter passed as string `'42'` - Seq Scan. Column `status varchar`, compared with `text` - usually fine. Column `phone varchar(20)`, parameter `char(20)` with trailing spaces - implicit cast. **Diagnosis:** EXPLAIN shows `Filter` instead of `Index Cond` - the index is not used in filtering. **Fix:** explicit cast in the query or correct the parameter type in the application.

EXPLAIN shows `Filter: ((user_id)::text = '42')` instead of `Index Cond`. What is happening?

Statistics: When the Planner Is Wrong

The PostgreSQL planner works with estimates, not real data. Estimates are built from statistics - histograms of value distributions collected by ANALYZE. Stale or inaccurate statistics cause 40% of query plan problems in production. After bulk INSERT/DELETE/UPDATE operations statistics become stale immediately.

**autovacuum and statistics:** autovacuum automatically runs ANALYZE when more than 20% of table rows change (the threshold). For hot tables with frequent updates this may not be enough. **Tuning:** `ALTER TABLE hot_table SET (autovacuum_analyze_scale_factor = 0.01)` - trigger at 1% change instead of 20%. **Skewed data:** values appearing in more than 5% of rows go into Most Common Values - statistics are accurate. The remainder uses a histogram of N buckets (default 100). For skewed columns - `SET STATISTICS 500`.

After loading 5M rows into the `events` table queries became much slower. EXPLAIN shows `rows=100` but `actual rows=500000`. First step?

pg_stat_statements: Find the Most Expensive Queries

Optimization without measurement is impossible. `pg_stat_statements` is a PostgreSQL extension that aggregates statistics across all executed queries: total_time, calls, mean_time, rows. It is the first diagnostic tool in any production PostgreSQL server.

**What pg_stat_statements shows:** query (normalized - parameters replaced by $1, $2), calls, total_exec_time, mean_exec_time, stddev_exec_time, rows, shared_blks_hit/read (cache hit ratio). **Top candidates for optimization:** (1) highest total_exec_time - this query consumes the most CPU overall, (2) highest mean_exec_time - every call is slow, (3) low cache hit ratio (blks_hit / (blks_hit + blks_read)) - query reads a lot from disk.

A slow query always means there is no index

Missing index is just one cause. Stale statistics, implicit cast, N+1, insufficient work_mem, Hash Join spilling to disk - all cause degradation with no relation to index presence.

The planner may have an index and not use it (implicit cast, cost model), or choose a suboptimal join algorithm (work_mem), or hit the database thousands of times instead of once (N+1). Diagnosis starts with EXPLAIN ANALYZE and pg_stat_statements.

pg_stat_statements shows one query with `calls=1, total_exec_time=30000ms` and another with `calls=100000, total_exec_time=5000ms`. Which one to optimize first?

Key Ideas

**Functions in WHERE** on an indexed column make the index useless - use a functional index or rewrite with a range condition on the original column.
**N+1** is an ORM architectural problem: use JOIN via relations or batch-load with WHERE id IN (...). **Implicit cast** - parameter type does not match column type = Seq Scan instead of Index Scan.
**pg_stat_statements** is the first diagnostic tool: sort by total_exec_time, take the top 3, run EXPLAIN ANALYZE, fix. Repeat weekly.

Вопросы для размышления

How can N+1 be detected in production without access to the application code?
When does an implicit cast not cause a problem and PostgreSQL still uses the index?
What is the optimal autovacuum_analyze_scale_factor for a table with 100M rows receiving 10,000 inserts per second?

Связанные уроки

alg-01-big-o

Databases

Query Tuning: From Slow to Fast

**GitLab** publicly documented their query tuning process: every merge request automatically runs EXPLAIN on affected tables and blocks the merge if a Seq Scan appears on a table with more than 10k rows.
**Notion** discovered their main page was making 47 SQL queries due to N+1 in GraphQL resolvers. Switching to DataLoader reduced it to 3 queries and cut page load by 400 ms.
**pg_stat_statements** paired with Grafana is a standard monitoring stack: the dashboard refreshes every 60 seconds and surfaces degrading queries before users notice the problem.

Slow Query Antipatterns

The query `WHERE EXTRACT(YEAR FROM created_at) = 2024` does not use the index on `created_at`. How to rewrite it?

N+1: The Most Expensive ORM Mistake

TypeORM application logs show 500 identical queries `SELECT * FROM users WHERE id = ?` in one second. What is the cause?

Implicit Casts: When a Type Kills the Index

EXPLAIN shows `Filter: ((user_id)::text = '42')` instead of `Index Cond`. What is happening?

Statistics: When the Planner Is Wrong

After loading 5M rows into the `events` table queries became much slower. EXPLAIN shows `rows=100` but `actual rows=500000`. First step?

pg_stat_statements: Find the Most Expensive Queries

A slow query always means there is no index

Missing index is just one cause. Stale statistics, implicit cast, N+1, insufficient work_mem, Hash Join spilling to disk - all cause degradation with no relation to index presence.

pg_stat_statements shows one query with `calls=1, total_exec_time=30000ms` and another with `calls=100000, total_exec_time=5000ms`. Which one to optimize first?

Key Ideas

**Functions in WHERE** on an indexed column make the index useless - use a functional index or rewrite with a range condition on the original column.
**N+1** is an ORM architectural problem: use JOIN via relations or batch-load with WHERE id IN (...). **Implicit cast** - parameter type does not match column type = Seq Scan instead of Index Scan.
**pg_stat_statements** is the first diagnostic tool: sort by total_exec_time, take the top 3, run EXPLAIN ANALYZE, fix. Repeat weekly.

Вопросы для размышления

How can N+1 be detected in production without access to the application code?
When does an implicit cast not cause a problem and PostgreSQL still uses the index?
What is the optimal autovacuum_analyze_scale_factor for a table with 100M rows receiving 10,000 inserts per second?

Связанные уроки

alg-01-big-o

Query Tuning: From Slow to Fast

Slow Query Antipatterns

N+1: The Most Expensive ORM Mistake

Implicit Casts: When a Type Kills the Index

Statistics: When the Planner Is Wrong

pg_stat_statements: Find the Most Expensive Queries

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Query Tuning: From Slow to Fast

Slow Query Antipatterns

N+1: The Most Expensive ORM Mistake

Implicit Casts: When a Type Kills the Index

Statistics: When the Planner Is Wrong

pg_stat_statements: Find the Most Expensive Queries

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки