JOIN Algorithms: Nested Loop, Hash, Merge
JOIN of two tables, 10 million rows each. Nested Loop: 5 hours. Hash Join: 3 minutes. Merge Join: 45 seconds (if indexes exist). Same result, 400x gap, all from algorithm choice.
- **Airbnb**, on booking-events JOIN with user_attributes (both > 500M rows), set work_mem=2GB for reporting sessions and forced Parallel Hash Join via max_parallel_workers_per_gather=8. Query time dropped from 45 minutes to 4 minutes.
- **Stripe** found the planner picking Nested Loop for payments * customers because of skewed stats on customers (many duplicates in customer_type). Extended statistics fixed the estimates, the planner switched to Hash Join, and latency went from 12s to 0.8s.
- **GitLab** uses Merge Join for the ci_builds * ci_jobs join (the largest tables in the schema) via composite indexes: both inputs are sorted on (pipeline_id, id), sorting is free, which wins over Hash Join under work_mem pressure.
- **Notion**, while replicating blocks between workspaces, hit Hash Join degradation with Batches:16 because work_mem was 4MB on an OLTP server. Solution: a separate connection pool with work_mem=512MB for bulk operations.
Nested Loop Join
**Nested Loop** is the simplest JOIN algorithm: for every row in the outer table, scan the inner one for matches. If inner has an index on the join column, the lookup is O(log N), which makes Nested Loop optimal for small datasets with highly selective join conditions.
Nested Loop turns dangerous on a large outer with no inner index: it degrades to O(M*N). The planner avoids it on big row estimates and picks Hash Join. If the planner misestimates and picks Nested Loop for 100k * 100k, it is a time disaster.
Under what condition is Nested Loop JOIN most efficient?
Hash Join
**Hash Join** builds a hash table from the smaller (build) side of the JOIN, then scans the larger (probe) side, checking each row through the hash. Complexity is O(M+N), linear, no indexes required. That makes Hash Join the workhorse for joining large tables.
Critical parameter: `Batches`. If `Batches > 1`, the hash table did not fit in `work_mem` and spilled to disk (grace hash join). Every extra batch forces another scan of the probe table. `Batches: 8` means the probe table is read 8 times, a potentially huge performance hit.
EXPLAIN ANALYZE shows `Hash Join: Batches: 4`. What does it mean and what should you do?
Merge Join
**Merge Join** requires both sides of the JOIN to be sorted by the join key. Two pointers then advance in sync: O(M+N) like Hash Join, but without a hash table. Advantage: if the data is already sorted (via an index), sorting is free.
| Algorithm | Complexity | Requirements | Best case |
|---|---|---|---|
| Nested Loop | O(M * log N) | Index on inner | Small outer + indexed inner |
| Hash Join | O(M + N) | work_mem for the hash table | Large tables without indexes |
| Merge Join | O(M + N) | Sorted inputs | Indexes on join keys, ORDER BY |
Merge Join was picked for two large tables without indexes on the join keys. What will the planner add to the plan?
work_mem and JOIN Performance
`work_mem` sets how much RAM is available per sort or hash operation. Crucial fact: it is a per-operation limit, not per query. A complex query with 5 hash joins and 3 sorts can use up to `work_mem * 8` of memory at once. At 100 concurrent connections that is `work_mem * 8 * 100` worst-case.
Production recommendation: keep global work_mem at 4-16MB, and set higher values for analytical queries via `SET work_mem` at session start. GitLab uses `statement_timeout` + a separate connection pool with large work_mem for reporting queries so analytics does not compete with OLTP.
Global work_mem was raised from 4MB to 256MB on a server with 32GB RAM and 200 concurrent connections. What's the risk?
Parallel JOINs
PostgreSQL can parallelize Hash Join and Merge Join (but not Nested Loop over inner) given enough resources. Parallelism kicks in when estimated rows exceed `min_parallel_table_scan_size` (default 8MB) and the query is not running inside a function or transaction with specific constraints.
Parallel Hash Join (PG11+) uses a shared hash table across workers. That is more efficient than every worker building its own copy. Before PG11 each worker in a parallel hash join built a separate table, which limited speedup.
Hash Join always beats Nested Loop on large tables
Nested Loop with an index can beat Hash Join even on large tables when the outer side filters down to few rows
If a query filters the outer table down to 10 rows via WHERE, Nested Loop does 10 * log(N) operations. That is incomparably better than Hash Join, which has to scan both sides to build the hash table. The planner accounts for this via cardinality estimates; the problem appears when estimates are wrong.
Parallel Hash Join with 4 workers and work_mem=64MB. How much memory can the hash table use?
Key Ideas
- **Nested Loop**: O(M * log N) with an index; perfect for small sets, disastrous without indexes on a large outer.
- **Hash Join**: O(M+N); the standard for large tables, but needs work_mem; Batches > 1 means disk spill and rereading the probe side.
- **Merge Join**: O(M+N); wins when inputs are already sorted (indexes), otherwise pays for Sort nodes.
- **work_mem**: per-operation limit; raising it globally is risky under many concurrent connections; use SET LOCAL for analytical queries.
Related Topics
JOIN algorithms are tightly bound to other PostgreSQL areas:
- EXPLAIN ANALYZE — The only way to see which algorithm was picked and whether Batches > 1 (disk spill) occurred
- Planner and Statistics — JOIN algorithm choice depends on cardinality estimates; bad stats produce suboptimal plans
- Index Types — Indexes on join keys make Nested Loop and Merge Join efficient; without them Hash Join is often the only sensible choice
Вопросы для размышления
- Which JOIN algorithm shows up most often in the plans of your slowest queries, and does that match your expectations?
- Are there tables in your schema with frequent JOINs where adding an index on the join key could flip the algorithm from Hash Join to Nested Loop?
- How is work_mem managed in your project: one global limit, or separate pools for OLTP and analytics?