Window Functions: ROW_NUMBER, RANK, LAG

The task: show each employee with their salary, the department average, and their rank within the department - in a single query, without losing rows. Before SQL:2003 this required three queries or black-magic self-joins. Window functions solve it in one `OVER()` clause.

**Financial analysis**: year-to-date revenue, 30-day moving average, period-over-period percentage change - the standard toolkit of any BI dashboard
**Top-N per group**: most recent transaction per user, best product per category - ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) replaces cumbersome subqueries
**A/B tests**: PERCENT_RANK() and NTILE() for analyzing conversion distributions across cohorts without exporting to Python

What Is a Window

Window functions are perhaps the most underrated tool in SQL. Before they existed, computing an employee's rank within a department required a correlated subquery or a self-join. A window function does the same in one line, while **not collapsing rows** - unlike GROUP BY. Every result row keeps its identity, and the function is computed over the 'window' of neighboring rows.

The key difference from aggregates: `GROUP BY` collapses many rows into one. A window function with `OVER()` computes an aggregate over a group but **returns a value per row** in that group. This shows detailed data and the aggregate in the same result set.

What is the fundamental difference between a window function with `OVER()` and an aggregate with `GROUP BY`?

PARTITION BY and ORDER BY

`OVER()` is an empty window: the function sees all rows in the table. `OVER(PARTITION BY col)` splits rows into independent groups - each partition is processed separately, as if it were its own table. `ORDER BY` inside `OVER()` defines row order **within** the partition - it affects ranking functions and the frame clause, but not the output order of the query.

`ORDER BY` inside `OVER()` and the final `ORDER BY` of the query are independent. The first determines how rows are ordered within a window for computation. The second determines output order. They can differ.

Query: `SELECT name, SUM(salary) OVER (PARTITION BY department ORDER BY hire_date) FROM employees`. What does SUM compute?

ROW_NUMBER, RANK, DENSE_RANK

Three ranking functions look similar but behave fundamentally differently when values tie. `ROW_NUMBER()` is always unique - even with equal values, the order within ties is arbitrary. `RANK()` assigns the same rank to tied values and skips subsequent numbers. `DENSE_RANK()` also gives tied values the same rank, but the next rank has no gap. The right choice depends on business logic.

A common pattern: `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)` + `WHERE rn = 1` in a CTE is an efficient way to select one row per group (e.g., the latest transaction per user). PostgreSQL also supports `DISTINCT ON`, but ROW_NUMBER is portable across databases.

Three employees have the same salary of 80000. What sequence of ranks will RANK() produce for them?

LAG and LEAD

`LAG(col, n)` returns the value of a column from the row **n positions back** in the window. `LEAD(col, n)` returns the value **n positions ahead**. The default is n=1. This is the standard solution for 'compare the current value with the previous one' tasks - period-over-period change, date differences, time-series analysis. Without these functions a self-join on a shifted index would be required.

`LAG(col, n, default)` accepts a third argument - a default value returned when no prior row exists (the first row in a partition). Without a default the function returns NULL. This matters when computing differences: the first element in a series has no preceding value.

Query: `LAG(price, 2, 0) OVER (PARTITION BY product_id ORDER BY date)`. What does this return for the second row in the partition?

Frame Clause

The frame clause specifies **exactly which rows** are included in the window for computation. By default when `ORDER BY` is present, the frame is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` - from the start of the partition to the current row. Without `ORDER BY`, the frame is the entire partition. A frame can be defined with `ROWS` (physical rows) or `RANGE` (rows with the same sort value). The frame clause is the foundation of moving averages.

`ROWS` vs `RANGE`: with duplicate ORDER BY values they behave differently. `ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING` is exactly 3 physical rows. `RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING` is all rows whose value falls in [current-1, current+1]. For moving averages over a fixed-size window, use `ROWS`.

Which frame clause is needed to compute a moving average over the last 5 days including the current day?

Running Totals and Practical Patterns

Running totals (cumulative sums) are the most common application of window functions in analytics. Classic scenarios: year-to-date revenue, account balance, percentile within a sample. PostgreSQL also supports `PERCENT_RANK()`, `CUME_DIST()`, and `NTILE(n)` for statistical distribution analysis.

Window functions are computed **after** WHERE, GROUP BY, and HAVING, but **before** the outer SELECT and ORDER BY. For this reason a window function cannot be used in WHERE - a CTE or subquery is needed. This is the key constraint when debugging complex queries.

Window functions in a query are always slow and should be replaced with subqueries

PostgreSQL optimizes window functions effectively. A self-join or correlated subquery for the same task is almost always slower. EXPLAIN ANALYZE will show the real plan.

The PostgreSQL planner can reuse an already-sorted set of rows for multiple window functions with the same OVER(). The WindowAgg algorithm runs in O(n); a self-join runs in O(n^2).

Why can't `WHERE ROW_NUMBER() OVER (ORDER BY salary) = 1` be written directly in a query?

Key ideas

**OVER()** marks a window function: computes an aggregate over a window of rows without collapsing the result, unlike GROUP BY
**PARTITION BY** creates independent windows; **ORDER BY** inside OVER defines order for ranking and running totals
**Frame clause** (ROWS/RANGE BETWEEN ...) precisely sets window boundaries - the foundation of moving averages; window functions run after WHERE, so filtering on them requires a CTE

Вопросы для размышления

How would you choose between ROW_NUMBER, RANK, and DENSE_RANK when building a leaderboard with possible ties?
Why is ROWS BETWEEN preferred over RANGE BETWEEN for moving averages on time series with duplicate dates?
How would you implement 'top 3 orders by amount per customer' without window functions, and how much more complex would that solution be?

Связанные уроки

pg-08-aggregation — Window functions are non-collapsing aggregates; aggregation semantics must be understood first
pg-10-cte — CTEs are the natural way to pre-filter data before applying window functions in analytics queries
pg-12-btree — Window ORDER BY can use B-tree indexes to avoid sorting; index knowledge improves window query tuning
pg-09-subqueries — Correlated subqueries vs window functions: both compute per-row context, but windows are far more efficient
db-06-sql-advanced