Databases
SQL: SELECT, JOIN, GROUP BY
Chamberlin and Boyce designed SEQUEL in 1974 with one goal: a business analyst without a math degree should be able to ask a database 'show me last quarter sales'. That same language now processes 40 million queries per second at TikTok and stores training data for large language models. The math Codd wrote in 1970 turned out to be universal.
- ML pipeline: SELECT + JOIN to assemble training datasets from multiple normalized tables with metadata
- Feature engineering: GROUP BY + aggregations as efficient server-side replacement for pandas groupby on large datasets
- A/B testing: GROUP BY variant + AVG(metric) for experiment statistics directly in the database
- Experiment tracking: subqueries to find the best-performing model per type from MLflow backend tables
- Recommendation systems: multi-table JOINs across users, items, and interaction event logs
SELECT and WHERE: Query Anatomy
PostgreSQL processes a query in 0.1ms. On 10 million rows. Without breaking a sweat. The secret is not a trick - it is that SELECT with WHERE maps directly to relational algebra that Edgar Codd formalized in 1970. Chamberlin and Boyce built SEQUEL in 1974 so that non-mathematicians could write these queries. The language survived five decades because the underlying math is sound.
**Execution order matters:** FROM -> WHERE -> SELECT -> ORDER BY -> LIMIT. This is why column aliases defined in SELECT are not visible in WHERE - WHERE runs first.
**BETWEEN is inclusive on both ends:** `WHERE age BETWEEN 18 AND 65` equals `age >= 18 AND age <= 65`. For date ranges: `BETWEEN '2024-01-01' AND '2024-12-31'` includes the entire last day (up to midnight).
The query `SELECT expr AS alias FROM t WHERE alias > 5` will:
JOIN: Connecting Tables
The claim that JOINs are slow is a myth. A slow JOIN is a symptom of a missing index on the join column. Without an index, the database scans both tables - that is the bottleneck, not JOIN itself. Codd's 1970 relational model was designed precisely around this idea: data lives in normalized tables, JOIN reconstructs the relationships on demand.
**When to use each:** INNER - only matching data needed (most common). LEFT - preserve all records from the primary table even without matches (audit reports, activity tracking). FULL OUTER - detect discrepancies between two data sources.
A LEFT JOIN between users and orders returns a row with user.name='Bob' and order.amount=NULL. This means:
GROUP BY and HAVING: Aggregation
GROUP BY does not group data in the colloquial sense - it partitions the table into buckets and collapses each bucket into one row. HAVING exists for one precise reason: WHERE cannot filter on aggregated values because aggregation happens after row filtering. This is not a design flaw - it follows directly from the execution order. Feature engineering in ML is, at its core, GROUP BY with aggregation functions applied to raw event logs.
**GROUP BY rule:** SELECT can only contain columns listed in GROUP BY or aggregate functions. PostgreSQL enforces this strictly. MySQL historically allowed violations - the returned values were undefined.
**A/B test analysis in pure SQL:** `SELECT variant, COUNT(*) AS users, AVG(metric) AS avg_metric FROM experiment_events GROUP BY variant HAVING COUNT(*) > 100` - standard pattern for quick significance checks without leaving the database.
Why does `WHERE COUNT(*) > 5` cause an error while `HAVING COUNT(*) > 5` works?
Subqueries: Queries Inside Queries
IBM System R in 1974 was the first database to implement subqueries. The motivation was practical: some questions cannot be answered in a single pass. Find all employees earning above average salary - that requires knowing the average first, then filtering. A subquery computes one result, hands it to the outer query, and the outer query uses it as a value or as a condition.
**Uncorrelated vs correlated:** an uncorrelated subquery runs once (fast). A correlated subquery runs once per row of the outer query - expensive on large tables. Replace with JOIN or window functions when performance matters.
A correlated subquery executes:
Set Operations: UNION, INTERSECT, EXCEPT
Oracle introduced UNION in its first commercial release in 1979, implementing Codd's relational algebra operators directly. The counterintuitive part: UNION removes duplicates by default, and to do so it sorts the entire result set. UNION ALL skips that sort. On millions of rows, the difference is measurable. When duplicates are impossible or acceptable, UNION ALL is always the right choice.
**UNION vs UNION ALL:** if duplicates are either impossible or acceptable, always use UNION ALL. UNION sorts the entire result set to find and remove duplicates, adding O(n log n) overhead.
**ORDER BY in UNION:** cannot be applied to individual subqueries. A single ORDER BY clause at the end of the entire UNION expression sorts the final combined result.
UNION works like append - it simply concatenates two result sets
UNION is a set operation that removes duplicates by sorting. UNION ALL is the append equivalent. Use UNION ALL when duplicates are acceptable - it is significantly faster
Set theory defines union as A union B containing unique elements. SQL implements this mathematically. UNION ALL is a practical extension for cases where uniqueness is not required and performance matters
UNION and UNION ALL both combine [1,2,3] and [2,3,4]. What does each return?
SQL: SELECT, JOIN, GROUP BY
- WHERE executes before SELECT - column aliases unavailable, aggregates impossible
- LEFT JOIN preserves all rows from the primary table (NULL where no match exists)
- HAVING filters groups after GROUP BY - the only place for aggregate conditions
- UNION removes duplicates via sorting; UNION ALL is faster when uniqueness is not required
Related Topics
SQL is the language; the relational model is the math behind it. Index structure determines query execution speed.
- Relational Model — Mathematical foundation - tables, keys, normalization that SQL operates on
- B-Tree Indexes — Determine whether SELECT and JOIN scan O(n) rows or O(log n)
- ACID Transactions — Consistency guarantees when multiple queries run concurrently
- Big-O Complexity — Complexity analysis of JOIN, GROUP BY, and UNION operations
Вопросы для размышления
- GROUP BY collapses each bucket into one row. What exactly determines which row survives when MAX finds multiple identical maximum values - and does SQL guarantee a specific one?
- A correlated subquery runs N times (once per outer row). Under what conditions is EXISTS faster than IN, and when does IN outperform EXISTS?
- UNION removes duplicates by sorting the combined result. Does this implicit sort interfere with a final ORDER BY clause, and why is ordering individual UNION subqueries forbidden?