Databases
PostgreSQL: Architecture and Key Features
In 2012, Instagram was serving 30 million users on MySQL. Over one weekend, the team migrated to PostgreSQL - zero downtime, zero data loss. The secret: PostgreSQL lets one transaction read the old version of data while another transaction writes a new version. Two concurrent states of the database, no blocking.
- **Instagram 2012**: MySQL to PostgreSQL migration over a weekend - readers never blocked by writers thanks to MVCC
- **Apple, Reddit, Twitch, Discord**: PostgreSQL in production at loads from 100K to 10M queries per second
- **pgvector**: built-in ANN search in PostgreSQL - embeddings and RAG without a separate Qdrant or Pinecone
- **Logical replication over WAL**: foundation for zero-downtime migrations and event sourcing in high-SLA systems
MVCC and WAL: Reading and Writing at the Same Time
2012. The Instagram engineering team migrates from MySQL to PostgreSQL over a single weekend - zero downtime, zero data loss. The secret is not clever tooling or heroic engineering hours. The secret is that PostgreSQL can serve reads and writes simultaneously without blocking readers. MySQL InnoDB also supports MVCC, but PostgreSQL builds it differently - and that architectural difference is what made the Instagram migration possible.
MVCC - Multi-Version Concurrency Control - works on one core principle: instead of locking a row during a write, PostgreSQL creates a new version of the row. Every transaction sees a **snapshot** of the database as it existed at transaction start time. Writers do not block readers. Readers do not block writers. SELECT never waits for UPDATE.
WAL - Write-Ahead Log - is the second pillar of PostgreSQL's architecture. Before modifying data pages on disk, PostgreSQL writes a description of the change to the log. If the server crashes, WAL replays changes on restart. This is the physical implementation of Durability from ACID: data is considered durable the moment the WAL record hits disk, not when heap files are updated.
WAL is not only for crash recovery. Logical Replication in PostgreSQL streams WAL to replicas. pgvector, TimescaleDB, Citus - all tap into WAL. It is the foundation of the entire PostgreSQL extension ecosystem.
Transaction A starts at txid=100. Transaction B (txid=90) is still running and has updated a row but not committed. What does transaction A see when it SELECTs that row?
VACUUM and TOAST: The Cost of Versioning
MVCC's versioning has a price. Every UPDATE creates a new row version - the old one does not disappear immediately. Without a cleanup mechanism, tables grow without bound. The phenomenon has an apt name: **table bloat**. PostgreSQL handles this through VACUUM.
VACUUM scans the table and marks dead row versions as space available for reuse. VACUUM does not return space to the OS - that requires VACUUM FULL (which takes an exclusive lock). Autovacuum is a background process that fires automatically once the dead-tuple count crosses a threshold.
**Transaction ID Wraparound** is one of PostgreSQL's rare but catastrophic failure modes. The txid counter is 32-bit: at 2 billion transactions, old data appears to be 'in the future' and the database goes read-only. Autovacuum is designed to prevent this, but high-throughput systems with lagging VACUUM are a real production risk. Monitor with: `SELECT max_age FROM pg_database WHERE datname = current_database()`.
TOAST - The Oversized-Attribute Storage Technique - handles large values. PostgreSQL cannot store a value longer than roughly 8 KB in a single heap page. Long strings, JSON documents, arrays - all are transparently moved to a companion TOAST table (`pg_toast.pg_toast_XXXXXX`, created automatically for any table with potentially large columns).
| TOAST Strategy | Description | When Applied |
|---|---|---|
| PLAIN | No TOAST, always store inline | Numbers, booleans, fixed-size types |
| EXTENDED | Compression + external storage (default) | text, jsonb, bytea by default |
| EXTERNAL | External storage, no compression | When substring access without decompression is needed |
| MAIN | Compression first, external only as last resort | Prefer inline, but allow TOAST if necessary |
Why is `SELECT *` on a jsonb-heavy table slower than `SELECT id, name`? TOAST. When fetching all columns, PostgreSQL must dereference the TOAST table for every row with a large value. Selecting only needed columns is not just about network bandwidth - it eliminates TOAST I/O and decompression entirely.
After one million UPDATEs on a table, with no VACUUM running, the table occupies 10 GB - but the actual live data is only 500 MB. What is this phenomenon called?
GiST, GIN, BRIN: Beyond B-tree
B-tree is the workhorse of relational databases. But PostgreSQL ships with a suite of specialized indexes, each of which beats B-tree in its own domain by orders of magnitude. GIN on full-text search is 10-100x faster than B-tree. BRIN on time-series data occupies 1000x less space at comparable query speed. GiST covers use cases for which B-tree has no concept at all.
| Index Type | Data Structure | Best For | Example |
|---|---|---|---|
| B-tree | Balanced tree | =, <, >, BETWEEN, LIKE 'prefix%' | WHERE created_at > '2024-01-01' |
| Hash | Hash table | Equality only (faster than B-tree for =) | WHERE user_id = 42 |
| GiST | Generalized search tree | Geometry, full-text, ranges, overlaps | WHERE point <-> '(0,0)' < 10 |
| GIN | Inverted index | Arrays, jsonb, tsvector, containment | WHERE tags @> '{python}' |
| BRIN | Block range index | Huge tables with physical ordering (time-series) | WHERE event_time BETWEEN ... |
| SP-GiST | Space-partitioned tree | Quad-trees, k-d trees, IP ranges | Geocoordinates, network subnets |
Partial indexes cover a subset of rows. If 95% of queries target active users but 80% of stored users are inactive, an index `WHERE is_active = true` is 5x smaller and faster than a full index. Expression indexes let the system index the result of a function: `CREATE INDEX ON users (lower(email))` makes `WHERE lower(email) = 'test@example.com'` index-scannable.
pgvector adds the `vector` type and HNSW/IVFFlat indexes for nearest-neighbor search. This is exactly the stack behind RAG pipelines: embeddings from OpenAI or sentence-transformers stored in pgvector, and `SELECT ... ORDER BY embedding <-> query_vec LIMIT 10` runs ANN search directly inside PostgreSQL - no separate Qdrant or Pinecone required.
Adding an index on every WHERE column improves performance
Indexes speed up reads but slow down writes. Index decisions are driven by real query patterns from pg_stat_statements and EXPLAIN ANALYZE
Every INSERT/UPDATE/DELETE must update all indexes on the table. A write-heavy table with 10 indexes incurs 10x write overhead. Partial indexes, expression indexes, and BRIN often deliver 10x gains over naive 'index everything'.
A table `events` has 500 million rows. Data is always inserted in chronological order, and all queries use date range filters. Which index gives the best size-to-speed ratio?
Key Ideas
- **MVCC** gives each transaction a snapshot via xmin/xmax fields in every row - SELECT never waits for UPDATE
- **WAL** records changes before applying them: the foundation of Durability, replication, and logical decoding
- **VACUUM** reclaims dead row versions left by MVCC - without it table bloat kills performance
- **TOAST** transparently moves large values (jsonb, text > 8KB) to a companion table - design schemas with this in mind
- **GIN/GiST/BRIN** are specialized indexes: BRIN on 500M rows uses 1 MB vs 10 GB for B-tree
Related Topics
PostgreSQL is a simultaneous implementation of several foundational database concepts:
- ACID Transactions — WAL implements Durability, MVCC implements Isolation
- MVCC in Depth — xmin/xmax mechanics, transaction IDs, vacuum and bloat
- B-tree Indexes — Foundation for understanding GiST/GIN/BRIN as specializations
- Transactions and Isolation — READ COMMITTED vs REPEATABLE READ isolation levels in PostgreSQL
Вопросы для размышления
- VACUUM FULL returns space to the OS but takes an exclusive lock. How would one reclaim space on a production table without downtime? (hint: pg_repack, table partitioning)
- If MVCC creates multiple row versions, how does PostgreSQL prevent two concurrent transactions from writing different values to the same row at the same moment?
- A GIN index on a jsonb column speeds up reads but adds overhead to every INSERT. At what read/write ratio does GIN stop paying for itself?
Связанные уроки
- db-14-mvcc — Internals in depth: xmin/xmax, transaction snapshots, visibility
- db-13-transactions — MVCC only makes sense through the transaction model
- db-15-locks — Locks vs MVCC: two philosophies of concurrency
- db-09-indexes-btree — B-tree is the foundation; GiST/GIN/BRIN are extensions
- db-03-acid — WAL is the physical implementation of Durability and Atomicity
- os-07-memory