PostgreSQL
Hunting Slow Queries
The application suddenly got 10x slower after a deploy. App logs are clean. How do you find the guilty query among thousands per second, in 5 minutes instead of 5 hours?
- **Basecamp** publicly described an incident where N+1 queries in a new feature drove 40000 queries per second instead of the expected 500. pg_stat_statements pinpointed it in 2 minutes: one query pattern with 80000 calls/min and a 0.3ms mean produced 70% of total DB load.
- **GitLab** baked a mandatory pg_stat_statements check into its SRE runbook: in any performance incident, the first step is to compare the top 10 queries by total_exec_time against the prior week's baseline. Regressions get isolated in 5-10 minutes.
- **Heroku** enables log_min_duration_statement = 2000 on every Postgres plan and routes slow-query alerts to Datadog. Teams get a Slack notification within a minute of a slow query appearing.
- **Notion**, during post-incident review, found that 60% of slow queries used `SELECT *` on a blocks table with 200+ columns. Switching to named columns cut network traffic by 40% and dropped average API response time from 1.2s to 0.3s.
pg_stat_statements: Per-Query Analytics
`pg_stat_statements` is a PostgreSQL extension that accumulates aggregated statistics across all executed queries: total time, call count, block-level operations. It is the first tool for finding slow queries in production. Unlike logs, it aggregates identical queries with different parameters into one row.
pg_stat_statements normalizes queries: `WHERE id = $1` aggregates every variant of `WHERE id = 1`, `WHERE id = 42`, etc. That gives you aggregation by query pattern rather than concrete value, which is exactly what analysis needs.
pg_stat_statements shows a query with mean_exec_time = 2ms but total_exec_time = 40000000ms (11 hours). What does that say?
auto_explain: Plans for Slow Queries
`auto_explain` is an extension that automatically logs execution plans for queries exceeding a duration threshold. Unlike pg_stat_statements, it shows not only timing but the full EXPLAIN ANALYZE plan, including which nodes were slow.
auto_explain with log_analyze=true runs EXPLAIN ANALYZE for every slow query, which adds minor overhead (timing measurements). On production, keep log_min_duration at 500ms or higher, otherwise log volume becomes unmanageable.
What is the key advantage of auto_explain over manual EXPLAIN ANALYZE for diagnosis?
log_min_duration_statement
`log_min_duration_statement` is the PostgreSQL parameter for logging slow queries directly to the PostgreSQL log. Unlike auto_explain, it only logs the time and query text (no plan), so overhead is much lower. First step in any performance investigation.
Setting log_min_duration_statement = 0 logs every query, which crushes disk and I/O on a busy server. A reasonable production threshold is 500ms-2000ms depending on application latency SLAs.
Logs contain lines like `LOG: temporary file: path ..., size 524288000`. What does that mean?
pgBadger: Log Analysis
`pgBadger` is a tool that parses and analyzes PostgreSQL logs. It generates an HTML report with statistics: top slow queries, distribution over the day, lock waits, connection counts, checkpoint activity. Indispensable for post-incident analysis: you can reconstruct the load picture over a specific window.
pgBadger relies on a correct `log_line_prefix`. Recommended: `'%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '`. Without `%t` (timestamp), pgBadger cannot build a timeline.
pgBadger shows a slow-query spike every day at 03:00. Most likely cause?
Common Slow-Query Antipatterns
Most PostgreSQL performance issues come down to a handful of recurring patterns. Knowing these antipatterns lets you diagnose and fix typical problems quickly.
- **N+1 queries**: the app runs 1 query, gets N rows, then runs N individual queries one per row. Fix: JOIN or IN ($ids) in a single query.
- **SELECT * instead of named columns**: extra columns inflate network traffic and prevent Index Only Scan.
- **LIKE '%pattern%'**: leading wildcard makes the index useless; Seq Scan every time. Fix: the pg_trgm extension for trigram indexes.
- **Functions in WHERE**: `WHERE LOWER(email) = $1` does not use the index on email. Fix: a functional index `CREATE INDEX ON users (LOWER(email))`.
- **Missing LIMIT**: an analytical query without LIMIT in an OLTP service reads millions of rows. Always add LIMIT for pagination.
- **Implicit type cast**: `WHERE user_id = '12345'` (integer vs text) triggers an implicit cast and Seq Scan. Types must match.
A slow query can only be fixed by adding an index
Most slow queries are fixed by rewriting the query itself: removing functions from WHERE, adding LIMIT, replacing N+1 with JOIN, selecting named columns instead of SELECT *
An index helps only if the query is written to use it. A function in WHERE, an implicit cast, a leading wildcard in LIKE: all of these make the index invisible to the planner. Rewrite the query first, then consider indexes.
Query `SELECT * FROM events WHERE DATE(created_at) = '2024-03-15'` does not use the index on created_at. Why?
Key Ideas
- **pg_stat_statements** is the first tool: aggregates queries by pattern, exposes total_exec_time and temp_blks_written; always on, no manual setup per query.
- **auto_explain** logs EXPLAIN ANALYZE automatically for queries above a threshold; catches rare degradations with real data, no reproduction needed.
- **log_min_duration_statement** is a cheap way to capture slow queries; log_temp_files surfaces disk spills caused by low work_mem.
- **Top antipatterns**: functions in WHERE, leading wildcard in LIKE, SELECT *, N+1, missing LIMIT, implicit type cast. Most are fixed by rewriting the query, not by adding indexes.
Related Topics
Slow-query diagnosis connects every PostgreSQL optimization tool:
- EXPLAIN ANALYZE — After pg_stat_statements flags a slow query, the next step is its plan via EXPLAIN ANALYZE
- Planner and Statistics — A slow query often means a bad plan from stale stats; understanding how the planner builds estimates is essential
- Index Maintenance — A missing index, or bloat on an existing one, is a frequent cause of slow queries surfaced via Seq Scan in EXPLAIN
Вопросы для размышления
- Is pg_stat_statements enabled on your production cluster, and when did you last review the slow-query stats?
- What log_min_duration_statement threshold makes sense for your app given its latency SLA?
- Are there queries in the codebase with functions in WHERE (LOWER(), DATE(), EXTRACT()) that may not be using an index?