PostgreSQL

Subqueries: EXISTS, IN, ANY, ALL

JOIN is the primary tool for combining tables. But some questions express more naturally another way: "find employees whose salary exceeds the average for their own department", or "separate orders with no matching customer". Subqueries allow building queries from queries - like functions from functions. The key is knowing when this is efficient and when it is a trap.

**Fraud detection:** find transactions whose amount exceeds the user's average over the past 30 days - a correlated subquery or window function
**Reporting:** departments with no employee holding a specific skill - NOT EXISTS is far safer than NOT IN when columns are nullable
**Data quality:** find orphan records (orders without a customer, line items without a product) - LEFT JOIN + IS NULL or NOT EXISTS

Scalar Subqueries

A **scalar subquery** is a SELECT nested inside another expression that returns exactly one row and one column. It can appear anywhere a scalar value is allowed: in SELECT, WHERE, HAVING, and ORDER BY.

**Key constraint:** if a scalar subquery returns NULL (no rows) or more than one row, PostgreSQL raises an error. For safety, use `LIMIT 1` or an aggregate function.

A scalar subquery in SELECT runs:

Table Subqueries in FROM

A subquery in the FROM clause returns a virtual table - a **derived table** (or inline view). It can be treated like any real table: JOINed, filtered, grouped. This lets you split a complex query into readable layers without temp tables.

**ANY and ALL:** `> ANY(subquery)` is true if the value is greater than at least one result. `> ALL(subquery)` is true if it is greater than every result. `= ANY(subquery)` is equivalent to `IN (subquery)`.

A subquery in the FROM clause must:

Correlated Subqueries

A **correlated subquery** references columns from the outer query. It re-executes for *every row* of the outer query - so N outer rows means N subquery executions. Flexible, but watch the plan: a nested loop here is O(N*M).

Correlated subqueries are indispensable for "find by own group" queries, but the PostgreSQL planner can often rewrite them as a JOIN or window function. `EXPLAIN ANALYZE` shows whether that optimization happened.

**LATERAL:** an extension of correlated subqueries for FROM. `FROM employees e, LATERAL (SELECT ... WHERE dept_id = e.id LIMIT 3) top3` allows a correlated subquery directly in FROM, returning multiple rows per outer row.

A correlated subquery with 10,000 rows in the outer table executes how many times?

EXISTS vs IN: When to Use Each

The question "does a related row exist" can be asked two ways: `IN (subquery)` and `EXISTS (subquery)`. Semantically equivalent for NOT NULL data, they behave differently when NULL values are involved - and the difference matters.

**PostgreSQL optimization:** the planner often transforms `IN (subquery)` into a semi-join, and `EXISTS` as well. In PostgreSQL 14+, the difference in plan between IN and EXISTS is often zero. The primary rule: use `NOT EXISTS` instead of `NOT IN` when nullable columns are involved.

A query `WHERE id NOT IN (SELECT manager_id FROM dept)` returns 0 rows, but the data should match. The most likely cause:

Subquery Performance

Subqueries are not magic: a poorly written one can turn a millisecond query into a minute-long one. Key antipatterns: a correlated subquery in SELECT instead of a JOIN, NOT IN against large tables, a subquery without an index on the correlation column.

The PostgreSQL planner can automatically transform many subqueries into JOINs. But this is not guaranteed - it depends on statistics, table sizes, and available indexes. `EXPLAIN ANALYZE` is the only way to confirm the plan is efficient.

**Practical heuristic:** if a non-correlated WHERE subquery returns a large dataset, the planner has likely already turned it into a HashJoin. A correlated subquery in SELECT, on the other hand, is almost always N extra table lookups.

EXISTS is always faster than IN

In PostgreSQL 14+, the planner transforms both into a semi-join. The real difference: NOT IN is unsafe with NULL; NOT EXISTS is safe. For performance, check EXPLAIN ANALYZE for the specific case.

The myth originated in Oracle in the 1990s, where EXISTS and IN were processed differently. PostgreSQL has long unified them into a single plan. What matters is correctness (NULL safety), not assumptions about speed.

The query `SELECT name, (SELECT dept_name FROM departments WHERE id = e.dept_id) FROM employees e` - what is its primary problem?

Subqueries in PostgreSQL

Scalar subquery: exactly 1 row, 1 column - valid in SELECT/WHERE/HAVING/ORDER BY
Derived table (FROM subquery): virtual table, alias required; foundation for complex aggregations
Correlated subquery: references the outer row, executes N times - watch for nested-loop O(N*M) in the plan
NOT IN with NULL in the subquery returns 0 rows - always use NOT EXISTS for nullable columns
EXPLAIN ANALYZE is the only way to confirm the plan is truly efficient

Вопросы для размышления

A correlated subquery in WHERE executes N times. Under what conditions can this be faster than a JOIN?
NOT IN is safe only with NOT NULL columns. How can EXPLAIN ANALYZE reveal whether the planner handled NULLs correctly?
In what cases is a derived table in FROM preferable to a CTE from the PostgreSQL planner's perspective?

Связанные уроки

pg-07-joins — Many subqueries can be rewritten as JOINs and vice versa; knowing both lets you pick for readability and performance
pg-10-cte — CTEs are named subqueries; mastering inline subqueries is prerequisite to understanding CTE semantics
pg-08-aggregation — Scalar subqueries in HAVING are a common pattern; aggregation context needed
pg-06-select — Subqueries are nested SELECTs; basic SELECT syntax must be solid first
db-05-sql-basics