Databases
Query Tuning: From Slow to Fast
In 2016 Uber migrated from PostgreSQL to MySQL and listed accumulated slow queries as one reason. Most issues were diagnosed only after the fact: N+1 in the ORM, implicit casts, stale statistics after bulk loads. The tools to catch these problems existed all along - nobody was looking at pg_stat_statements.
- **GitLab** publicly documented their query tuning process: every merge request automatically runs EXPLAIN on affected tables and blocks the merge if a Seq Scan appears on a table with more than 10k rows.
- **Notion** discovered their main page was making 47 SQL queries due to N+1 in GraphQL resolvers. Switching to DataLoader reduced it to 3 queries and cut page load by 400 ms.
- **pg_stat_statements** paired with Grafana is a standard monitoring stack: the dashboard refreshes every 60 seconds and surfaces degrading queries before users notice the problem.
Slow Query Antipatterns
90% of slow queries have one of five causes: a function applied to an indexed column in WHERE, a LIKE with a leading wildcard, SELECT *, OR conditions instead of UNION, or a correlated subquery. Each pattern kills indexes and forces the planner into a full Seq Scan.
**Index killers:** `WHERE UPPER(email) = 'TEST@EXAMPLE.COM'` - the function makes the index on `email` useless (need a functional index). `WHERE name LIKE '%smith%'` - leading wildcard means Seq Scan (need pg_trgm + GIN). `WHERE status != 'active'` - negative conditions rarely use indexes. `WHERE created_at::date = '2024-01-01'` - the cast makes the index on `created_at` useless.
The query `WHERE EXTRACT(YEAR FROM created_at) = 2024` does not use the index on `created_at`. How to rewrite it?
N+1: The Most Expensive ORM Mistake
N+1 occurs when loading a list of N objects requires N+1 queries: one for the list and one for each related record. At N=1000 that is 1001 round-trips to the database. Unnoticeable locally, but with 1 ms network latency in production that is 1 second spent solely on network round-trips.
**Signs of N+1 in logs:** the same query repeated with different parameters (`WHERE id = $1` with different values). **ORM solution (TypeORM):** `relations: ['author']` or `leftJoinAndSelect`. **SQL solution:** JOIN or `WHERE id IN (...)`. **For deep hierarchies:** recursive CTE (`WITH RECURSIVE`). **DataLoader pattern** (GraphQL): groups N requests into one batch per event loop tick.
TypeORM application logs show 500 identical queries `SELECT * FROM users WHERE id = ?` in one second. What is the cause?
Implicit Casts: When a Type Kills the Index
Implicit type casting is a treacherous trap: the query looks correct, returns the right data, but the index is not used. PostgreSQL is strictly typed: if the column is `bigint` and the parameter is `varchar`, the planner cannot apply the B-tree index without an explicit cast. ORM libraries and drivers sometimes pass parameters of the wrong type.
**Common cases:** column `user_id bigint`, parameter passed as string `'42'` - Seq Scan. Column `status varchar`, compared with `text` - usually fine. Column `phone varchar(20)`, parameter `char(20)` with trailing spaces - implicit cast. **Diagnosis:** EXPLAIN shows `Filter` instead of `Index Cond` - the index is not used in filtering. **Fix:** explicit cast in the query or correct the parameter type in the application.
EXPLAIN shows `Filter: ((user_id)::text = '42')` instead of `Index Cond`. What is happening?
Statistics: When the Planner Is Wrong
The PostgreSQL planner works with estimates, not real data. Estimates are built from statistics - histograms of value distributions collected by ANALYZE. Stale or inaccurate statistics cause 40% of query plan problems in production. After bulk INSERT/DELETE/UPDATE operations statistics become stale immediately.
**autovacuum and statistics:** autovacuum automatically runs ANALYZE when more than 20% of table rows change (the threshold). For hot tables with frequent updates this may not be enough. **Tuning:** `ALTER TABLE hot_table SET (autovacuum_analyze_scale_factor = 0.01)` - trigger at 1% change instead of 20%. **Skewed data:** values appearing in more than 5% of rows go into Most Common Values - statistics are accurate. The remainder uses a histogram of N buckets (default 100). For skewed columns - `SET STATISTICS 500`.
After loading 5M rows into the `events` table queries became much slower. EXPLAIN shows `rows=100` but `actual rows=500000`. First step?
pg_stat_statements: Find the Most Expensive Queries
Optimization without measurement is impossible. `pg_stat_statements` is a PostgreSQL extension that aggregates statistics across all executed queries: total_time, calls, mean_time, rows. It is the first diagnostic tool in any production PostgreSQL server.
**What pg_stat_statements shows:** query (normalized - parameters replaced by $1, $2), calls, total_exec_time, mean_exec_time, stddev_exec_time, rows, shared_blks_hit/read (cache hit ratio). **Top candidates for optimization:** (1) highest total_exec_time - this query consumes the most CPU overall, (2) highest mean_exec_time - every call is slow, (3) low cache hit ratio (blks_hit / (blks_hit + blks_read)) - query reads a lot from disk.
A slow query always means there is no index
Missing index is just one cause. Stale statistics, implicit cast, N+1, insufficient work_mem, Hash Join spilling to disk - all cause degradation with no relation to index presence.
The planner may have an index and not use it (implicit cast, cost model), or choose a suboptimal join algorithm (work_mem), or hit the database thousands of times instead of once (N+1). Diagnosis starts with EXPLAIN ANALYZE and pg_stat_statements.
pg_stat_statements shows one query with `calls=1, total_exec_time=30000ms` and another with `calls=100000, total_exec_time=5000ms`. Which one to optimize first?
Key Ideas
- **Functions in WHERE** on an indexed column make the index useless - use a functional index or rewrite with a range condition on the original column.
- **N+1** is an ORM architectural problem: use JOIN via relations or batch-load with WHERE id IN (...). **Implicit cast** - parameter type does not match column type = Seq Scan instead of Index Scan.
- **pg_stat_statements** is the first diagnostic tool: sort by total_exec_time, take the top 3, run EXPLAIN ANALYZE, fix. Repeat weekly.
Related Topics
Query Tuning applies knowledge of query plans and indexes:
- EXPLAIN ANALYZE: Reading the Query Plan — Foundational plan analysis tools used during tuning
- Transactions and Isolation — MVCC and table bloat affect query performance
Вопросы для размышления
- How can N+1 be detected in production without access to the application code?
- When does an implicit cast not cause a problem and PostgreSQL still uses the index?
- What is the optimal autovacuum_analyze_scale_factor for a table with 100M rows receiving 10,000 inserts per second?