Databases

SQL: SELECT, JOIN, GROUP BY

Chamberlin and Boyce designed SEQUEL in 1974 with one goal: a business analyst without a math degree should be able to ask a database 'show me last quarter sales'. That same language now processes 40 million queries per second at TikTok and stores training data for large language models. The math Codd wrote in 1970 turned out to be universal.

ML pipeline: SELECT + JOIN to assemble training datasets from multiple normalized tables with metadata
Feature engineering: GROUP BY + aggregations as efficient server-side replacement for pandas groupby on large datasets
A/B testing: GROUP BY variant + AVG(metric) for experiment statistics directly in the database
Experiment tracking: subqueries to find the best-performing model per type from MLflow backend tables
Recommendation systems: multi-table JOINs across users, items, and interaction event logs

SELECT and WHERE: Query Anatomy

PostgreSQL processes a query in 0.1ms. On 10 million rows. Without breaking a sweat. The secret is not a trick - it is that SELECT with WHERE maps directly to relational algebra that Edgar Codd formalized in 1970. Chamberlin and Boyce built SEQUEL in 1974 so that non-mathematicians could write these queries. The language survived five decades because the underlying math is sound.

**Execution order matters:** FROM -> WHERE -> SELECT -> ORDER BY -> LIMIT. This is why column aliases defined in SELECT are not visible in WHERE - WHERE runs first.

**BETWEEN is inclusive on both ends:** `WHERE age BETWEEN 18 AND 65` equals `age >= 18 AND age <= 65`. For date ranges: `BETWEEN '2024-01-01' AND '2024-12-31'` includes the entire last day (up to midnight).

The query `SELECT expr AS alias FROM t WHERE alias > 5` will:

JOIN: Connecting Tables

The claim that JOINs are slow is a myth. A slow JOIN is a symptom of a missing index on the join column. Without an index, the database scans both tables - that is the bottleneck, not JOIN itself. Codd's 1970 relational model was designed precisely around this idea: data lives in normalized tables, JOIN reconstructs the relationships on demand.

**When to use each:** INNER - only matching data needed (most common). LEFT - preserve all records from the primary table even without matches (audit reports, activity tracking). FULL OUTER - detect discrepancies between two data sources.

A LEFT JOIN between users and orders returns a row with user.name='Bob' and order.amount=NULL. This means:

GROUP BY and HAVING: Aggregation

GROUP BY does not group data in the colloquial sense - it partitions the table into buckets and collapses each bucket into one row. HAVING exists for one precise reason: WHERE cannot filter on aggregated values because aggregation happens after row filtering. This is not a design flaw - it follows directly from the execution order. Feature engineering in ML is, at its core, GROUP BY with aggregation functions applied to raw event logs.

**GROUP BY rule:** SELECT can only contain columns listed in GROUP BY or aggregate functions. PostgreSQL enforces this strictly. MySQL historically allowed violations - the returned values were undefined.

**A/B test analysis in pure SQL:** `SELECT variant, COUNT(*) AS users, AVG(metric) AS avg_metric FROM experiment_events GROUP BY variant HAVING COUNT(*) > 100` - standard pattern for quick significance checks without leaving the database.

Why does `WHERE COUNT(*) > 5` cause an error while `HAVING COUNT(*) > 5` works?

Subqueries: Queries Inside Queries

IBM System R in 1974 was the first database to implement subqueries. The motivation was practical: some questions cannot be answered in a single pass. Find all employees earning above average salary - that requires knowing the average first, then filtering. A subquery computes one result, hands it to the outer query, and the outer query uses it as a value or as a condition.

**Uncorrelated vs correlated:** an uncorrelated subquery runs once (fast). A correlated subquery runs once per row of the outer query - expensive on large tables. Replace with JOIN or window functions when performance matters.

A correlated subquery executes:

Set Operations: UNION, INTERSECT, EXCEPT

Oracle introduced UNION in its first commercial release in 1979, implementing Codd's relational algebra operators directly. The counterintuitive part: UNION removes duplicates by default, and to do so it sorts the entire result set. UNION ALL skips that sort. On millions of rows, the difference is measurable. When duplicates are impossible or acceptable, UNION ALL is always the right choice.

**UNION vs UNION ALL:** if duplicates are either impossible or acceptable, always use UNION ALL. UNION sorts the entire result set to find and remove duplicates, adding O(n log n) overhead.

**ORDER BY in UNION:** cannot be applied to individual subqueries. A single ORDER BY clause at the end of the entire UNION expression sorts the final combined result.

UNION works like append - it simply concatenates two result sets

UNION is a set operation that removes duplicates by sorting. UNION ALL is the append equivalent. Use UNION ALL when duplicates are acceptable - it is significantly faster

Set theory defines union as A union B containing unique elements. SQL implements this mathematically. UNION ALL is a practical extension for cases where uniqueness is not required and performance matters

UNION and UNION ALL both combine [1,2,3] and [2,3,4]. What does each return?