PostgreSQL
SELECT: filtering, sorting, LIMIT
`SELECT * FROM orders WHERE status NOT IN ('cancelled', NULL)` - this query returns 0 rows regardless of what the table contains. NULL inside an IN list expands to `status != NULL`, which evaluates to UNKNOWN for every row. SQL's three-valued boolean algebra silently destroys results, without raising an error.
- **Pagination API** - switching from OFFSET to keyset pagination cuts page-500 load time from 2 seconds to 5 ms
- **Analytics dashboard** - `DISTINCT ON` replaces a subquery with ROW_NUMBER for the 'latest event per user' pattern, simplifying the query significantly
- **Financial reports** - NULLIF prevents `division by zero` when aggregating over periods with zero activity
WHERE: operators and types
WHERE filters rows before aggregation and grouping. The choice of operator affects not just correctness but also whether the planner can use an index.
**A leading wildcard in LIKE bypasses the B-tree index.** `LIKE '%gmail.com'` causes a full Seq Scan. For suffix and substring search, use `pg_trgm` with a GIN index: `CREATE INDEX ON users USING GIN (email gin_trgm_ops);`
Implicit type casting can break index usage. When a column is `int4` but the query passes a string literal:
**EXISTS vs IN** for large subqueries: `IN` materializes the entire subquery in memory. `EXISTS` short-circuits on the first match and typically produces a more efficient plan.
**ALL** is the counterpart to ANY: `amount > ALL(ARRAY[10, 20, 50])` means greater than every element. Rarely needed, but useful for defensive range checks.
Column `user_id` is `int4` with a B-tree index. Which query is guaranteed to use the index?
ORDER BY, LIMIT, OFFSET and keyset pagination
ORDER BY determines the row order in the result. Without it, order is undefined - PostgreSQL returns rows in whatever order is convenient for the planner (often a heap scan order, but not always).
**OFFSET scales poorly.** To return page 100 with OFFSET 9900 LIMIT 100, PostgreSQL reads and discards 9,900 rows - the work is done but the result is thrown away.
| Method | Speed at page 1 | Speed at page 500 | Random access | Stable with new rows |
|---|---|---|---|---|
| OFFSET/LIMIT | fast | slow (O(n)) | yes | no (duplicates) |
| Keyset (cursor) | fast | fast (O(1)) | no | yes (stable) |
Keyset pagination requires a stable unique sort key (typically `id` or `(created_at, id)`). For APIs without random page jumps it is almost always the better choice.
The query `SELECT * FROM events ORDER BY id LIMIT 20 OFFSET 50000` is slow. What is happening internally?
DISTINCT ON: PostgreSQL's way to pick the first row per group
`DISTINCT ON` is a PostgreSQL extension to standard SQL. It returns exactly one row for each unique combination of the specified columns, selecting the "first" row according to ORDER BY.
**Rule:** columns in `DISTINCT ON (...)` must match the leading columns in `ORDER BY`. PostgreSQL enforces this - violating it raises: `SELECT DISTINCT ON expressions must match initial ORDER BY expressions`.
**Performance:** PostgreSQL can often use an index on `(user_id, created_at DESC)` for DISTINCT ON without a separate sort step - the plan shows `Index Scan` instead of `Sort`.
The goal is to retrieve the most recent order per user (all columns). Which approach is correct in PostgreSQL?
CASE, COALESCE, NULLIF and guarding against division by zero
`CASE` is the only conditional construct in standard SQL. It works in SELECT (computed columns), WHERE (complex conditions), and ORDER BY (custom sort order).
`COALESCE(a, b, c)` returns the first non-NULL argument. It is the standard tool for default values and column fallback chains.
**COALESCE is lazy:** it evaluates arguments left to right and stops at the first non-null. This matters when arguments contain subqueries or expensive expressions.
A query contains `total_revenue / total_orders`. When `total_orders = 0`, what happens and how should it be fixed?
NULL: three-valued logic and traps
NULL in SQL is not a value - it is the absence of a value. Any comparison involving NULL returns UNKNOWN (the third value in SQL's boolean logic, alongside TRUE and FALSE), and WHERE only passes rows where the result is TRUE.
**The NOT IN trap with NULLs in the list.** If a subquery or list contains even one NULL, `NOT IN` returns an empty result for every row - mathematically correct, but almost never the intended behavior.
**Rule of thumb:** always handle NULL explicitly via `IS NULL` / `IS NOT NULL` / `COALESCE`. Never rely on `!=` or `NOT IN` when the data may contain NULLs.
WHERE col != 'x' returns all rows except those where col = 'x'
WHERE col != 'x' returns rows where col = 'x' is false AND col is not NULL. Rows where col IS NULL are silently excluded.
NULL participates in comparisons as UNKNOWN. UNKNOWN != TRUE, so WHERE does not pass those rows. This behavior follows the SQL standard but consistently surprises even experienced developers.
The `payments` table has 1,000 rows. The query `SELECT * FROM payments WHERE refunded_at != '2024-01-01'` returns 800 rows. Why not 1,000?
SELECT: filtering, sorting, LIMIT
- A leading wildcard in LIKE (`%text`) bypasses the B-tree index - use GIN + pg_trgm instead
- OFFSET degrades on deep pages: keyset pagination (`WHERE id > last`) runs in O(1)
- DISTINCT ON is PostgreSQL's way to pick the first row per group - simpler than ROW_NUMBER for common cases
- NULL inside NOT IN makes the entire result empty - use NOT EXISTS as the NULL-safe alternative
Related topics
SELECT is the foundation for more complex queries and query optimization.
- Indexes in PostgreSQL — WHERE operators directly determine whether an index can be used
- JOIN and subqueries — EXISTS/IN patterns are often rewritten as JOINs with a better plan
- EXPLAIN and the query planner — Always verify the actual plan for any non-trivial query
Вопросы для размышления
- In what circumstances is OFFSET pagination still an acceptable choice, and when does switching to keyset pagination become necessary?
- Why is `NOT IN` with a subquery considered an anti-pattern in production code, and how does its behavior differ from `NOT EXISTS`?
- How does NULL's three-valued logic affect aggregate functions - COUNT(*) vs COUNT(col) - and what does that mean for analytical queries?
Связанные уроки
- pg-05-dml — DML covers INSERT/UPDATE/DELETE; SELECT is the R in CRUD and uses the same table context
- pg-07-joins — JOINs extend SELECT to multiple tables; mastering single-table SELECT is the prerequisite
- pg-08-aggregation — GROUP BY and aggregate functions layer on top of the WHERE/ORDER BY learned in SELECT basics
- pg-12-btree — B-tree indexes accelerate the ORDER BY and range predicates introduced in this lesson
- db-06-sql-advanced