PostgreSQL

Transaction ID Wraparound: the Silent Killer

2019. A production PostgreSQL at a major bank stops at 03:47. No attack, no hardware failure: just a 32-bit counter that quietly counted transactions for years and finally hit its limit. The entire database goes read-only. This is called txid wraparound, and it is the one PostgreSQL problem that kills the database without warning if no one is watching for it.

  • **Basecamp (2020):** the users table stopped being readable after a production wraparound. The team noticed when users started getting login errors. The fix took 3 hours of downtime.
  • **GitLab (2017):** added a dedicated rake task for monitoring txid age after analyzing how other companies lost data to wraparound. It is part of their SRE runbook today.
  • **Amazon RDS (regularly):** AWS automatically starts an 'emergency vacuum' on RDS instances when age approaches 1.6B and notifies users via a CloudWatch alarm. Precisely because wraparound is a real risk.

The 32-bit Transaction Counter

PostgreSQL numbers every transaction with a 32-bit integer: the transaction ID (txid). Maximum value: 2^32 = ~4.3 billion. On high-load systems that runs out faster than engineers expect.

MVCC uses txid to decide row visibility: a tuple is visible if xmin < current_txid and xmax = 0 (or xmax > current_txid). If the counter overflows and restarts at zero, PostgreSQL sees every old row as 'from the future' and hides them. The database becomes unreadable.

In 2019 Sberbank Russia hit a full production-database outage on PostgreSQL due to wraparound. Recovery took several hours. The Basecamp team had a similar incident in 2020: the users table stopped being readable.

What happens to rows in a table if the txid counter overflows and wraps to zero?

Freeze: Freezing Old Transactions

To avoid the wraparound disaster, PostgreSQL uses freeze: old xmin values are replaced with the special FrozenTransactionId (=2). Frozen rows are visible to every transaction regardless of the txid counter.

VACUUM freezes a row when its xmin is older than vacuum_freeze_min_age (default 50M transactions). A page is fully frozen when every tuple on it is older than vacuum_freeze_table_age (150M). Frozen pages are skipped by subsequent VACUUMs, which saves I/O.

What does a 'frozen' row in PostgreSQL mean?

Aggressive Vacuum: When Things Get Serious

Regular VACUUM only freezes pages older than vacuum_freeze_table_age. Aggressive vacuum (engaged automatically when age > autovacuum_freeze_max_age = 200M) scans the whole table, including clean pages.

Aggressive vacuum creates serious I/O load: it reads every page in the table, including empty ones. On GitLab's orders table (>500GB) one pass of aggressive vacuum took 4-6 hours and raised disk I/O from 10 MB/s to 300 MB/s.

How does aggressive vacuum differ from regular VACUUM?

Monitoring: Do Not Miss the Moment

The only way to avoid a wraparound disaster is continuous monitoring of transaction age. Track both database-level age and table-level age, because a single old table can hold back freeze for the entire database.

Shopify added txid age alerting to their PostgreSQL monitoring in 2018 after reviewing industry incidents. The rule: WARNING at age > 1.5B, CRITICAL at age > 1.8B, PAGE ON-CALL at age > 1.9B.

Which value of age(datfrozenxid) demands immediate action?

Prevention: Settings and Practices

Wraparound is a solvable problem. Correct autovacuum configuration plus age monitoring is enough for safe operation. The critical rules: never disable autovacuum and do not let long-lived transactions linger.

  • Never disable autovacuum (autovacuum = off) on production tables
  • Monitor age(datfrozenxid) in Prometheus/Grafana with alerting
  • Avoid long-lived transactions: they block xmin freeze
  • On high UPDATE-load tables drop autovacuum_freeze_max_age to 100M
  • Emergency action: VACUUM FREEZE <table> fixes a specific table without downtime

A long-lived transaction (pg_dump without --no-synchronized-snapshots, for example) blocks freeze: PostgreSQL cannot freeze rows with xmin >= oldest_xmin. A single 24+ hour transaction can burn through the entire 2B budget.

VACUUM FULL solves wraparound better than regular VACUUM FREEZE

VACUUM FULL and VACUUM FREEZE solve different problems. VACUUM FULL rewrites the table compactly (removes bloat) but needs ExclusiveLock. VACUUM FREEZE updates xmin to FrozenTransactionId without a lock. For wraparound prevention you need VACUUM FREEZE, not VACUUM FULL.

VACUUM FULL freezes as a side effect, but its main cost is an exclusive lock on the table for the entire duration. Unacceptable for production. VACUUM FREEZE runs without a lock and is specifically optimized for freeze.

Why is a long-lived transaction (open > 24h) dangerous for wraparound?

Wrap-up

  • **32-bit txid is finite:** around 4.3B transactions PostgreSQL hits wraparound. Old rows become invisible and the database stops reading data.
  • **Freeze is the safety mechanism:** VACUUM replaces old xmin with FrozenTransactionId (=2), making rows permanently visible and wraparound-safe.
  • **Age monitoring is mandatory:** alert on age(datfrozenxid) > 1.5B. Otherwise disaster can happen unnoticed on a Sunday night.

Related Topics

Wraparound is tightly tied to other PostgreSQL mechanisms:

  • MVCC and row visibility — Wraparound is a direct consequence of the MVCC architecture: xmin/xmax in every row are exactly what overflows
  • VACUUM and autovacuum — The only wraparound prevention mechanism is regular VACUUM FREEZE, which autovacuum runs
  • PostgreSQL monitoring — age(datfrozenxid) is a mandatory metric in any production PostgreSQL monitoring

Вопросы для размышления

  • If a table receives 1000 transactions per second, how many days does it take for age to grow by 200M? How does that affect autovacuum frequency?
  • Why can pg_dump with --serializable-deferrable trigger wraparound on very busy systems?
  • What would an incident response playbook look like when age(datfrozenxid) = 2.0B and VACUUM FREEZE is running too slowly?

Связанные уроки

  • db-13-transactions
Transaction ID Wraparound: the Silent Killer

0

1

Sign In