PostgreSQL

WAL and Checkpoint Tuning

Discord: 100K messages/sec in PostgreSQL. Default synchronous_commit = on. TPS = 12K. After synchronous_commit = off for presence counters and read receipts, TPS = 190K. The same 5 lines of config. Understanding WAL parameters is the difference between renting 10 servers and using one.

  • **Discord** synchronous_commit=off for non-critical data (presence, read receipts): 15x TPS gain without changing app code
  • **GitLab** max_wal_size = 4 GB, checkpoint_timeout = 15 min: I/O spikes during checkpoints dropped by 70%, stable P99 latency
  • **AWS RDS** automatically tunes wal_buffers, max_wal_size, and checkpoint parameters when the instance class changes

wal_buffers: The WAL Write Buffer

**wal_buffers** is a shared memory area that buffers WAL records before they are flushed to disk. The WAL writer flushes the buffer on COMMIT or when the buffer is full. Default: -1 (auto-calculated = 1/32 of shared_buffers, max 64 MB). Usually sufficient for high OLTP loads.

**synchronous_commit** affects wal_buffers: with synchronous_commit=off PostgreSQL does not wait for the WAL fsync on COMMIT. The transaction is acknowledged to the client before WAL hits the disk. Faster, but a crash can lose the last ~200 ms of transactions. Suitable for non-critical data (counters, statistics).

synchronous_commit = off. PostgreSQL acknowledged COMMIT to the client. The server crashes 50 ms later. What happens to the transaction?

Checkpoint: Tuning Dirty Page Flush

**Checkpoint** is a periodic flush of all dirty pages from shared_buffers to disk. After a checkpoint, WAL up to that point can be deleted (no longer needed for recovery). The problem: a naive checkpoint flushes everything at once and creates an I/O spike. checkpoint_completion_target spreads the writes over 90% of checkpoint_timeout.

**A large checkpoint_timeout** means longer recovery after a crash. PostgreSQL has to replay WAL since the last checkpoint. Compromise: checkpoint_timeout = 15 min gives a good balance between I/O and RTO. For fast recovery: checkpoint_timeout = 5 min.

checkpoints_req >> checkpoints_timed (forced checkpoints happen more often than scheduled ones). What should you change?

max_wal_size: WAL Buffer

**max_wal_size** is the maximum WAL volume between checkpoints. If WAL grows to max_wal_size, a forced checkpoint runs off schedule. Larger max_wal_size = fewer forced checkpoints and fewer I/O spikes, but more WAL on disk and longer recovery on crash.

**Bulk load pattern:** when loading huge data volumes (billions of rows), temporarily raise max_wal_size to 32-64 GB and checkpoint_timeout to 1 hour. After loading, run CHECKPOINT and revert. This reduces I/O overhead during ETL by 3 to 5x.

max_wal_size = 16 GB. checkpoint_timeout = 15 min. PostgreSQL crashes. How much WAL maximum needs to be replayed during recovery?

commit_delay: Group COMMIT

**commit_delay** is the delay before WAL fsync on COMMIT. If other transactions reach COMMIT during that delay, PostgreSQL does one fsync for the group. Group commit reduces I/O load under high OLTP. Default: 0 (no delay). Active only when active_backends > commit_siblings (default: 5).

**commit_delay helps on HDDs, less critical on NVMe.** On NVMe SSD fsync takes ~50-100 us and group commit gives no significant win. On HDD fsync = 10-30 ms, group commit cuts operations and yields 2 to 5x TPS gain on OLTP.

commit_delay = 100, commit_siblings = 5. The system has only 3 active backends. What happens on COMMIT?

synchronous_commit: Latency vs Durability Tradeoff

**synchronous_commit** controls durability guarantees on COMMIT. The parameter can be changed per session or per transaction. Different parts of an application can have different requirements. Critical financial transactions: synchronous_commit=on. View counters: synchronous_commit=off.

**synchronous_commit=off does not break DB consistency.** Transactions remain atomic: they either fully apply or not at all. The only risk is losing the last few transactions on server crash. There is no in-server race.

synchronous_commit = off can lead to a partially applied transaction (some rows present, some missing)

synchronous_commit = off does not break atomicity or consistency. A transaction is either fully applied or not at all. The only risk is losing the most recent transactions on a power failure

WAL guarantees atomicity: the transaction is written to WAL fully before COMMIT or rolled back. synchronous_commit only controls whether to wait for WAL fsync. On a crash without fsync, the transaction simply will not find itself in WAL and rolls back, but partial application does not happen

App: 70% transactions are clicks/analytics, 30% are financial operations. How do you tune synchronous_commit?

Key Ideas

  • **wal_buffers = 64 MB**: the default (auto-calc) is usually enough. Monitoring: buffers_backend in pg_stat_bgwriter
  • **checkpoint_completion_target = 0.9** + raising checkpoint_timeout to 15-30 min = fewer I/O spikes. checkpoints_req > checkpoints_timed = raise max_wal_size
  • **max_wal_size = 4-16 GB** for production. Larger means fewer forced checkpoints but longer recovery
  • **commit_delay** helps on HDDs under high OLTP. On NVMe the benefit is negligible
  • **synchronous_commit** can be changed per transaction: off for analytics, on for financial operations

Related Topics

WAL parameters tie into several PostgreSQL mechanisms:

  • WAL: Write-Ahead Log — Base WAL concepts: segments, LSN, checkpoint. Required to understand these parameters
  • Streaming Replication — synchronous_commit = on/remote_write/remote_apply only works with a synchronous standby
  • Memory Tuning — shared_buffers and wal_buffers both live in shared memory. The size of one affects what is available for the other

Вопросы для размышления

  • An application runs 50K INSERT/sec. synchronous_commit = on. Disk: 7200 RPM HDD (150 IOPS fsync). Calculate maximum TPS under this constraint. What does synchronous_commit = off give you?
  • max_wal_size = 2 GB. At 14:00 a bulk load of 10M rows starts and lasts 5 minutes. pg_stat_bgwriter shows checkpoints_req spiking. How do you tune to avoid excess checkpoints during bulk load?
  • checkpoint_timeout = 5 min, max_wal_size = 1 GB. The server crashes. What is the maximum RTO (recovery time)? How does raising checkpoint_timeout to 30 min change RTO?

Связанные уроки

  • db-03-acid
WAL and Checkpoint Tuning

0

1

Sign In