PostgreSQL
WAL: Write-Ahead Log and Durability
Every time PostgreSQL says 'COMMIT', the data is already on disk. Not in memory, not 'about to be written'. On disk. That is the durability guarantee, the D in ACID. The mechanism that delivers it is WAL. Without understanding WAL you cannot tune performance correctly, set up replication, or explain why a database recovered from a crash so quickly.
- **Stripe (2022):** moving from HDD to NVMe SSD, the main win was WAL fsync latency dropping from 3-5 ms to 0.15 ms. Throughput rose 40% with no application code change. WAL simply got faster.
- **Heroku PostgreSQL:** promises crash recovery < 10 minutes on instances up to 500GB thanks to optimized WAL replay on NVMe. An SLA guarantee made possible by the WAL architecture.
- **Citus (distributed PostgreSQL):** uses WAL replication to synchronize when adding shards. wal_level = logical is required for logical replication between nodes. WAL configuration directly enables horizontal scaling.
What WAL Is and Why It Exists
The Write-Ahead Log (WAL) is the core durability mechanism in PostgreSQL. One rule: before changing any data on disk, write to WAL first. If the server crashes at any moment, WAL has enough information for full recovery.
WAL lets PostgreSQL defer writing dirty pages from shared_buffers to disk. Having WAL on disk is enough. This dramatically speeds up COMMIT: sequential writes to a WAL file beat random writes into table heap files.
Stripe migrated to NVMe SSD specifically for WAL: on SATA SSD a WAL fsync took 2-5 ms; on NVMe 0.1-0.3 ms. At 5000 TPS that is the difference between a 99th percentile of 8 ms and 50 ms for transactional queries.
Why does COMMIT return success to the client only after WAL is on disk, but not wait for table pages to be written?
WAL Segments and LSN
WAL is stored as a sequence of 16MB segments (default) in $PGDATA/pg_wal/. Each WAL record has an address: the LSN (Log Sequence Number), a 64-bit monotonically increasing value.
PostgreSQL pre-allocates several WAL segments ahead (wal_keep_size or min_wal_size). The number of WAL files is controlled by max_wal_size (default 1GB). When that is exceeded, a checkpoint starts or old files are recycled.
What is an LSN (Log Sequence Number)?
Checkpoints: Synchronizing WAL and Heap
A checkpoint is the moment when PostgreSQL guarantees that every dirty page from shared_buffers has been written to disk. After a checkpoint, crash recovery only needs to replay WAL written after that point, not the entire WAL history.
checkpoint_completion_target = 0.9 means: spread dirty-page writes over 90% of the time between checkpoints. This smooths the I/O load. If checkpoints_req >> checkpoints_timed in pg_stat_bgwriter, max_wal_size is too small and checkpoints are happening too often because of WAL overflow.
What does a high checkpoints_req value in pg_stat_bgwriter mean?
Full Page Writes: Protection from Torn Pages
The OS and the disk use 512B or 4KB blocks; PostgreSQL uses 8KB pages. On a power failure an 8KB page write can be interrupted halfway: half of the page is old, half is new. That is a torn page.
CloudNativePG (the Kubernetes operator for PostgreSQL) always keeps full_page_writes = on even on enterprise SSDs with battery backup. The reason: protection is needed not only from power failures but also from disk driver and firmware bugs.
Why does PostgreSQL write a full page image into WAL only on the first modification after a checkpoint, instead of on every modification?
Crash Recovery: from LSN to Live Database
On startup after a crash PostgreSQL automatically launches recovery. It finds the last checkpoint, reads WAL from that point, and replays each record. Transactions without COMMIT are rolled back. The process is fully automatic.
Heroku PostgreSQL guarantees RTO (Recovery Time Objective) < 10 minutes for instances up to 500GB. Possible because crash recovery (WAL replay) runs at sequential disk read speeds, which on NVMe is 3-5 GB/s.
synchronous_commit = off is dangerous and must never be used
synchronous_commit = off trades the durability of the last ~200 ms of transactions (wal_writer_delay) for lower latency. That is acceptable for non-critical data: view counters, analytics events, logs. A transaction will not be partially lost; it is lost entirely or not at all.
With synchronous_commit = off, COMMIT returns success before WAL fsync. If the server crashes at that moment, the last few hundred milliseconds of transactions are lost. They are lost completely: WAL does not contain them at all. This is different from data corruption: the data is consistent, just some operations roll back during recovery.
How does PostgreSQL know which LSN to start WAL replay from during crash recovery?
Wrap-up
- **WAL = durability:** WAL writes precede any change to data on disk. COMMIT returns success only after WAL fsync. That is the D in ACID.
- **Checkpoint = recovery boundary:** crash recovery only replays WAL after the last checkpoint. Frequent checkpoints speed recovery but raise I/O.
- **full_page_writes protects against torn pages:** the first modification of a page after a checkpoint writes the full 8KB page into WAL so deltas can be applied even after a partial page write.
Related Topics
WAL is the foundation for many PostgreSQL mechanisms:
- Replication (streaming replication) — Standby servers receive the same WAL records and replay them. Replication runs on top of the WAL mechanism.
- Shared Buffers and Buffer Cache — WAL lets PostgreSQL defer dirty-page writes from the buffer cache. Understanding the interaction is critical for tuning.
- Background Workers and processes — The walwriter process is responsible for periodic flushes of the WAL buffer; checkpointer runs checkpoints and works with bgwriter.
Вопросы для размышления
- If max_wal_size goes from 1GB to 8GB, how does that affect checkpoint frequency, crash recovery time, and I/O load?
- Why does logical replication require wal_level = logical, while streaming replication works with wal_level = replica? What extra information goes into WAL under logical?
- How would you compute the optimal checkpoint_timeout for a system where RTO (Recovery Time Objective) = 5 minutes and WAL is generated at 100 MB/s?