PostgreSQL
Safe PostgreSQL Migrations
GitHub, 2014. The `pull_requests` table held 20M rows. They needed a new index for a new filter. CREATE INDEX without CONCURRENTLY: 8 minutes of locking on production. 8 minutes during which thousands of developers could not open a pull request. After that incident GitHub published the 'How we made GitHub fast' guide, and CONCURRENTLY became the industry standard.
- **GitHub** published a public guide on zero-downtime migrations: expand-contract for renames, CONCURRENTLY for indexes, batching for backfills. It became the de facto standard
- **Shopify** at RailsConf 2019: Large-Scale Migrations at Shopify. The LHM (Large Hadron Migrator) tool performs ALTER TABLE without downtime on tables with 100B+ rows
- **Stripe** has a strict rule: no migration without a rollback plan. Expand-contract for every schema change. Staging with production volume is mandatory before rollout
CREATE INDEX CONCURRENTLY: Without Locking
**CREATE INDEX** without CONCURRENTLY takes a SHARE lock, blocking INSERT/UPDATE/DELETE for the duration of the build (minutes on a large table). **CREATE INDEX CONCURRENTLY** builds the index in multiple passes without blocking DML. It takes longer, but production stays available.
**CREATE INDEX CONCURRENTLY cannot run inside a transaction.** If interrupted, it leaves an INVALID index that slows down writes but is not used by the planner. Monitoring indisvalid after migrations is mandatory.
CREATE INDEX CONCURRENTLY on a 500M-row table failed after 30 minutes due to an error. What is the state of the index?
Safe ALTER TABLE: Minimal Locking
**ALTER TABLE takes an ACCESS EXCLUSIVE lock**, blocking every query for the duration of the operation. Some operations are instant (a metadata change); others require rewriting the entire table (hours on large tables). Knowing the difference is critical for zero-downtime deploys.
**PostgreSQL 12+**: `ADD COLUMN column_name TYPE DEFAULT value` is instant even for non-null defaults. The value is stored in metadata rather than rewritten into every row. Before PostgreSQL 11, this required a full table rewrite.
You need to add CHECK (amount > 0) to a 200M-row table without downtime. What is the problem with a plain ADD CONSTRAINT?
Expand-Contract: Renaming Columns Without Downtime
**Expand-Contract** is a pattern for changing a schema without downtime. Use it when you need to rename a column, change a type, or migrate data. The idea: first expand the schema, then migrate the data, then remove the old (contract).
**Batch backfills** are critical: UPDATE on a 200M-row table is a single huge transaction, holds locks for minutes, and creates massive bloat from dead tuples. Batches of 10K rows with pauses give autovacuum time to clean up, keep locks short, and stay invisible to production.
Backfilling 100M rows in a single UPDATE without batching. What happens to table performance afterward?
pg_repack: Defragmentation Without Locking
**pg_repack** is a PostgreSQL extension that defragments (repacks) tables and indexes without long-running locks. Like VACUUM FULL, but it holds ACCESS EXCLUSIVE only at the very end (seconds). It creates a new table, copies the data, and swaps atomically.
**When to use pg_repack:** bloat > 30-40% of the table after large DELETE/UPDATE operations, heavily inflated indexes (REINDEX CONCURRENTLY is an alternative for indexes alone). Cloudflare uses pg_repack for weekly maintenance on tables with high UPDATE rates.
pg_repack is running on a 500 GB table. Are production queries blocked?
Migration Tools: Flyway, Liquibase, sqitch
**Migration tools** manage the history of schema changes. Each migration is a versioned SQL file, applied once in the right order. A migration history table records what has been applied. Rolling back migrations is a separate strategy.
| Tool | Language | Highlights |
|---|---|---|
| Flyway | Java (CLI available) | Simple, version-based, broad ecosystem |
| Liquibase | Java | XML/YAML/SQL formats, rollback out of the box, changelog |
| sqitch | Perl/Any | Git-like deploy/revert/verify, no version numbers |
| golang-migrate | Go | Lightweight, CLI + library, embed support |
ALTER TABLE is fast: it is just a metadata change
Some ALTER TABLE operations are instant (adding a nullable column, DROP COLUMN), others rewrite the entire table (type changes, adding NOT NULL with validation). On a 100M-row table, a rewrite means hours of locking
PostgreSQL must check or rewrite every row for some changes. Changing INT to BIGINT requires checking each value. Adding NOT NULL requires verifying that every value is present. PostgreSQL 12+ optimized ADD COLUMN DEFAULT, making it instant. Knowing which operations are instant is a critical skill for zero-downtime deploys.
Flyway migration V5 contains CREATE INDEX CONCURRENTLY. Flyway wraps each migration in a transaction. What happens?
Key Ideas
- **CREATE INDEX CONCURRENTLY** is always required in production. An interrupted CONCURRENTLY leaves an INVALID index and needs a DROP + rebuild.
- **ALTER TABLE**: check before applying. ADD COLUMN nullable is instant; a type change is a full rewrite.
- **NOT VALID + VALIDATE CONSTRAINT** is the safe way to add CHECK without a full scan under ACCESS EXCLUSIVE.
- **Expand-Contract** is the pattern for zero-downtime renames and type changes: add new -> sync trigger -> backfill -> switch app -> drop old.
- **Batch backfills** of 10K rows with pauses give autovacuum room to keep up, hold locks briefly, and stay invisible to production.
Related Topics
Migrations touch several core PostgreSQL areas:
- Locks — Understanding lock levels (ACCESS EXCLUSIVE vs SHARE UPDATE EXCLUSIVE) is the basis for choosing a safe migration strategy
- MVCC — Batching UPDATEs matters because of MVCC dead tuples: every UPDATE creates a dead row version
- PostgreSQL upgrades — Zero-downtime major upgrades through logical replication apply the same expand-contract principles at the database level
Вопросы для размышления
- The `transactions` table has an `amount NUMERIC(10,2)` column. It needs to become `NUMERIC(15,4)`. How do you change it without downtime? What happens if you simply run `ALTER TABLE transactions ALTER COLUMN amount TYPE NUMERIC(15,4)`?
- Backfill of 50M rows: a developer wrote a loop that UPDATEs 10K rows at a time. After 1000 iterations (10M rows) pg_stat_user_tables shows n_dead_tup = 9M. Autovacuum is running. Should the backfill be stopped?
- Migration V10 creates a CONCURRENTLY index and adds a CHECK constraint without NOT VALID. Flyway runs everything in one transaction. What breaks, and how should the migration be restructured?