PostgreSQL

Hunting Slow Queries

The application suddenly got 10x slower after a deploy. App logs are clean. How do you find the guilty query among thousands per second, in 5 minutes instead of 5 hours?

**Basecamp** publicly described an incident where N+1 queries in a new feature drove 40000 queries per second instead of the expected 500. pg_stat_statements pinpointed it in 2 minutes: one query pattern with 80000 calls/min and a 0.3ms mean produced 70% of total DB load.
**GitLab** baked a mandatory pg_stat_statements check into its SRE runbook: in any performance incident, the first step is to compare the top 10 queries by total_exec_time against the prior week's baseline. Regressions get isolated in 5-10 minutes.
**Heroku** enables log_min_duration_statement = 2000 on every Postgres plan and routes slow-query alerts to Datadog. Teams get a Slack notification within a minute of a slow query appearing.
**Notion**, during post-incident review, found that 60% of slow queries used `SELECT *` on a blocks table with 200+ columns. Switching to named columns cut network traffic by 40% and dropped average API response time from 1.2s to 0.3s.

pg_stat_statements: Per-Query Analytics

`pg_stat_statements` is a PostgreSQL extension that accumulates aggregated statistics across all executed queries: total time, call count, block-level operations. It is the first tool for finding slow queries in production. Unlike logs, it aggregates identical queries with different parameters into one row.

pg_stat_statements normalizes queries: `WHERE id = $1` aggregates every variant of `WHERE id = 1`, `WHERE id = 42`, etc. That gives you aggregation by query pattern rather than concrete value, which is exactly what analysis needs.

pg_stat_statements shows a query with mean_exec_time = 2ms but total_exec_time = 40000000ms (11 hours). What does that say?

auto_explain: Plans for Slow Queries

`auto_explain` is an extension that automatically logs execution plans for queries exceeding a duration threshold. Unlike pg_stat_statements, it shows not only timing but the full EXPLAIN ANALYZE plan, including which nodes were slow.

auto_explain with log_analyze=true runs EXPLAIN ANALYZE for every slow query, which adds minor overhead (timing measurements). On production, keep log_min_duration at 500ms or higher, otherwise log volume becomes unmanageable.

What is the key advantage of auto_explain over manual EXPLAIN ANALYZE for diagnosis?

log_min_duration_statement

`log_min_duration_statement` is the PostgreSQL parameter for logging slow queries directly to the PostgreSQL log. Unlike auto_explain, it only logs the time and query text (no plan), so overhead is much lower. First step in any performance investigation.

Setting log_min_duration_statement = 0 logs every query, which crushes disk and I/O on a busy server. A reasonable production threshold is 500ms-2000ms depending on application latency SLAs.

Logs contain lines like `LOG: temporary file: path ..., size 524288000`. What does that mean?

pgBadger: Log Analysis

`pgBadger` is a tool that parses and analyzes PostgreSQL logs. It generates an HTML report with statistics: top slow queries, distribution over the day, lock waits, connection counts, checkpoint activity. Indispensable for post-incident analysis: you can reconstruct the load picture over a specific window.

pgBadger relies on a correct `log_line_prefix`. Recommended: `'%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '`. Without `%t` (timestamp), pgBadger cannot build a timeline.

pgBadger shows a slow-query spike every day at 03:00. Most likely cause?

Common Slow-Query Antipatterns

Most PostgreSQL performance issues come down to a handful of recurring patterns. Knowing these antipatterns lets you diagnose and fix typical problems quickly.

**N+1 queries**: the app runs 1 query, gets N rows, then runs N individual queries one per row. Fix: JOIN or IN ($ids) in a single query.
**SELECT * instead of named columns**: extra columns inflate network traffic and prevent Index Only Scan.
**LIKE '%pattern%'**: leading wildcard makes the index useless; Seq Scan every time. Fix: the pg_trgm extension for trigram indexes.
**Functions in WHERE**: `WHERE LOWER(email) = $1` does not use the index on email. Fix: a functional index `CREATE INDEX ON users (LOWER(email))`.
**Missing LIMIT**: an analytical query without LIMIT in an OLTP service reads millions of rows. Always add LIMIT for pagination.
**Implicit type cast**: `WHERE user_id = '12345'` (integer vs text) triggers an implicit cast and Seq Scan. Types must match.

A slow query can only be fixed by adding an index

Most slow queries are fixed by rewriting the query itself: removing functions from WHERE, adding LIMIT, replacing N+1 with JOIN, selecting named columns instead of SELECT *

An index helps only if the query is written to use it. A function in WHERE, an implicit cast, a leading wildcard in LIKE: all of these make the index invisible to the planner. Rewrite the query first, then consider indexes.

Query `SELECT * FROM events WHERE DATE(created_at) = '2024-03-15'` does not use the index on created_at. Why?

Key Ideas

**pg_stat_statements** is the first tool: aggregates queries by pattern, exposes total_exec_time and temp_blks_written; always on, no manual setup per query.
**auto_explain** logs EXPLAIN ANALYZE automatically for queries above a threshold; catches rare degradations with real data, no reproduction needed.
**log_min_duration_statement** is a cheap way to capture slow queries; log_temp_files surfaces disk spills caused by low work_mem.
**Top antipatterns**: functions in WHERE, leading wildcard in LIKE, SELECT *, N+1, missing LIMIT, implicit type cast. Most are fixed by rewriting the query, not by adding indexes.

Вопросы для размышления

Is pg_stat_statements enabled on your production cluster, and when did you last review the slow-query stats?
What log_min_duration_statement threshold makes sense for your app given its latency SLA?
Are there queries in the codebase with functions in WHERE (LOWER(), DATE(), EXTRACT()) that may not be using an index?

Связанные уроки

db-11-query-optimization

PostgreSQL

Hunting Slow Queries

The application suddenly got 10x slower after a deploy. App logs are clean. How do you find the guilty query among thousands per second, in 5 minutes instead of 5 hours?

**Basecamp** publicly described an incident where N+1 queries in a new feature drove 40000 queries per second instead of the expected 500. pg_stat_statements pinpointed it in 2 minutes: one query pattern with 80000 calls/min and a 0.3ms mean produced 70% of total DB load.
**GitLab** baked a mandatory pg_stat_statements check into its SRE runbook: in any performance incident, the first step is to compare the top 10 queries by total_exec_time against the prior week's baseline. Regressions get isolated in 5-10 minutes.
**Heroku** enables log_min_duration_statement = 2000 on every Postgres plan and routes slow-query alerts to Datadog. Teams get a Slack notification within a minute of a slow query appearing.
**Notion**, during post-incident review, found that 60% of slow queries used `SELECT *` on a blocks table with 200+ columns. Switching to named columns cut network traffic by 40% and dropped average API response time from 1.2s to 0.3s.

pg_stat_statements: Per-Query Analytics

pg_stat_statements shows a query with mean_exec_time = 2ms but total_exec_time = 40000000ms (11 hours). What does that say?

auto_explain: Plans for Slow Queries

What is the key advantage of auto_explain over manual EXPLAIN ANALYZE for diagnosis?

log_min_duration_statement

Setting log_min_duration_statement = 0 logs every query, which crushes disk and I/O on a busy server. A reasonable production threshold is 500ms-2000ms depending on application latency SLAs.

Logs contain lines like `LOG: temporary file: path ..., size 524288000`. What does that mean?

pgBadger: Log Analysis

pgBadger relies on a correct `log_line_prefix`. Recommended: `'%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '`. Without `%t` (timestamp), pgBadger cannot build a timeline.

pgBadger shows a slow-query spike every day at 03:00. Most likely cause?

Common Slow-Query Antipatterns

Most PostgreSQL performance issues come down to a handful of recurring patterns. Knowing these antipatterns lets you diagnose and fix typical problems quickly.

**N+1 queries**: the app runs 1 query, gets N rows, then runs N individual queries one per row. Fix: JOIN or IN ($ids) in a single query.
**SELECT * instead of named columns**: extra columns inflate network traffic and prevent Index Only Scan.
**LIKE '%pattern%'**: leading wildcard makes the index useless; Seq Scan every time. Fix: the pg_trgm extension for trigram indexes.
**Functions in WHERE**: `WHERE LOWER(email) = $1` does not use the index on email. Fix: a functional index `CREATE INDEX ON users (LOWER(email))`.
**Missing LIMIT**: an analytical query without LIMIT in an OLTP service reads millions of rows. Always add LIMIT for pagination.
**Implicit type cast**: `WHERE user_id = '12345'` (integer vs text) triggers an implicit cast and Seq Scan. Types must match.

A slow query can only be fixed by adding an index

Most slow queries are fixed by rewriting the query itself: removing functions from WHERE, adding LIMIT, replacing N+1 with JOIN, selecting named columns instead of SELECT *

Query `SELECT * FROM events WHERE DATE(created_at) = '2024-03-15'` does not use the index on created_at. Why?

Key Ideas

**pg_stat_statements** is the first tool: aggregates queries by pattern, exposes total_exec_time and temp_blks_written; always on, no manual setup per query.
**auto_explain** logs EXPLAIN ANALYZE automatically for queries above a threshold; catches rare degradations with real data, no reproduction needed.
**log_min_duration_statement** is a cheap way to capture slow queries; log_temp_files surfaces disk spills caused by low work_mem.
**Top antipatterns**: functions in WHERE, leading wildcard in LIKE, SELECT *, N+1, missing LIMIT, implicit type cast. Most are fixed by rewriting the query, not by adding indexes.

Вопросы для размышления

Is pg_stat_statements enabled on your production cluster, and when did you last review the slow-query stats?
What log_min_duration_statement threshold makes sense for your app given its latency SLA?
Are there queries in the codebase with functions in WHERE (LOWER(), DATE(), EXTRACT()) that may not be using an index?

Связанные уроки

db-11-query-optimization

Hunting Slow Queries

pg_stat_statements: Per-Query Analytics

auto_explain: Plans for Slow Queries

log_min_duration_statement

pgBadger: Log Analysis

Common Slow-Query Antipatterns

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Hunting Slow Queries

pg_stat_statements: Per-Query Analytics

auto_explain: Plans for Slow Queries

log_min_duration_statement

pgBadger: Log Analysis

Common Slow-Query Antipatterns

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки