PostgreSQL
PostgreSQL Backup and Recovery
August 2017. GitLab.com. A DBA accidentally runs `rm -rf` on the production DB instead of staging. 300 GB of data gone. It turns out replication had been broken for 6 hours, S3 backups were not working, and pg_dump was not working. Only a single WAL archive existed, 6 hours stale. 6 hours of lost data, 18 hours of downtime, a public postmortem. That story changed the industry: GitLab now has 5 independent backup systems. PostgreSQL has everything required for reliable backup. You just need to configure it and verify it.
- **GitLab** after the 2017 incident: 5 backup tiers including pgBackRest on S3, daily restore verification on a test server, monitoring on every tier
- **Supabase** pgBackRest + S3 for all 1M+ projects: continuous WAL archiving, PITR accurate to the second, daily full backups with a weekly retention
- **Dropbox** PITR for restoring user files: after accidental deletion the DB is restored to a point before the error, then the needed rows are migrated into production
pg_dump: Logical Backup
**pg_dump creates a logical backup**: it exports data as SQL commands or in a special format. It works at the level of tables and schemas, independent of the PostgreSQL version and physical file layout. You can restore on a different version, a different platform, into a different schema.
**pg_dump does not block queries**, but it takes a snapshot of the data at the moment the dump starts (MVCC snapshot). For a consistent cluster-wide backup, use pg_dumpall or a physical backup. pg_dump does not save WAL data. Recovery is only possible to the moment of the dump, not to an arbitrary point in time.
pg_dump is running on a 100 GB production database. INSERT/UPDATE activity continues during the dump. Will these changes appear in the dump?
pg_basebackup: Physical Backup
**pg_basebackup creates a physical copy of the whole cluster**: all data files exactly as they are on disk. Unlike pg_dump, the backup is created quickly (independent of the row count), but it is tied to the major PostgreSQL version and the platform. It is the foundation for streaming replication and PITR.
**pg_basebackup vs pg_dump:** basebackup is 5 to 10 times faster on large DBs (I/O bound, not CPU bound). But a basebackup cannot be restored to a different PostgreSQL version and you cannot pick individual tables. Production usually needs both: basebackup for fast full-DB restore, dump for targeted table restore.
pg_basebackup is complete. PostgreSQL is upgraded from 15 to 16. Can the pg15 backup be restored directly on pg16?
WAL Archiving: Continuous Backup
**WAL archiving** saves each WAL segment to reliable storage as it fills up. Combined with a base backup, this enables Point-In-Time Recovery (PITR). PostgreSQL invokes `archive_command` when it closes each 16 MB WAL segment.
**If archive_command returns a non-zero code, PostgreSQL retries and does not delete the WAL segment.** This guarantees that no WAL segment is lost. But if archiving stays broken for a long time, WAL piles up on disk. Monitoring failed_count is mandatory.
archive_command = `cp %p /mnt/nfs/wal/%f`. NFS is down. What happens to PostgreSQL?
PITR: Point-In-Time Recovery
**Point-In-Time Recovery (PITR) restores the database to any moment** in the past. You need: a base backup (pg_basebackup) plus every WAL segment since the backup. PostgreSQL replays WAL up to a specified time, transaction, LSN, or transaction ID.
**Dropbox restored user data via PITR** after accidental deletions: thousands of files recovered within seconds, accurate to the second. PostgreSQL replays transactions from WAL in exactly the order they were committed.
PITR is configured with recovery_target_time = '2024-03-15 12:00:00'. The last base backup was at 00:00. What is required for a successful recovery?
pgBackRest and Barman: Production-Grade Tools
**pgBackRest and Barman** are specialized tools for managing PostgreSQL backups. They solve the problems of the vanilla approach: parallel backup/restore, incremental backups, built-in encryption, integration with S3/GCS/Azure, retention policy management, and automatic verification.
| Tool | Strengths | Users |
|---|---|---|
| pgBackRest | Parallelism, S3, encryption, delta restore | AWS RDS, Aiven, Digital Ocean managed PG |
| Barman | WAL streaming, SSH-based, Ansible integration | 2ndQuadrant clients, Zalando |
| pg_dump + cron | Simplicity, logical backup | Small projects, dev environments |
**pgBackRest delta restore** restores only changed pages, not the entire backup. On a 1 TB DB with minor corruption, delta restore takes minutes, full restore takes hours. Supabase uses pgBackRest for backups of every project on S3.
Streaming replication counts as a backup because the data exists on a standby
Streaming replication is not a backup. An erroneous DELETE or data corruption replicates to the standby immediately. A real backup is a separate copy from a point in time before the error
The job of a backup is to protect against logical errors (accidental DROP TABLE, incorrect UPDATE). Streaming replication replicates every operation, including the bad ones. Both are needed: replication for HA (protects against hardware failure), backup + PITR for protection against data errors
pgBackRest is configured with repo1-retention-full=2. Five full backups have been taken. How many backups remain after the 6th?
Key Ideas
- **pg_dump** logical backup: MVCC snapshot, cross-version, but slow on big DBs and no PITR
- **pg_basebackup** physical backup: fast, full cluster, required for PITR and replication, tied to major version
- **WAL archiving** (archive_command) + basebackup = continuous backup. Every 16MB WAL segment is copied to storage
- **PITR** restores the DB to any moment: basebackup + WAL chain up to recovery_target_time/lsn/xid
- **pgBackRest** the production-grade tool: parallel backup, S3, encryption, delta restore, auto-verification
Related Topics
Backup and replication complement each other in production:
- Streaming Replication — pg_basebackup is the foundation for creating a standby. WAL archiving complements streaming replication for PITR
- WAL: Write-Ahead Log — WAL archiving is built on top of the WAL mechanism. Understanding WAL segments and their lifecycle is critical for correct backup setup
- PostgreSQL Upgrade — A full backup before a major upgrade is mandatory. pg_basebackup is the rollback point if the upgrade goes wrong
Вопросы для размышления
- The team runs pg_dump every night at 03:00. At 14:30 a developer runs `DELETE FROM users WHERE 1=1`. How much data is lost? How would WAL archiving change the situation?
- Backing up a 500 GB DB takes 2 hours. Restore also takes 2 hours. RTO requirement is 30 minutes. How does pgBackRest delta restore solve this? What does it require?
- archive_command copies WAL to NFS. pg_stat_archiver shows failed_count growing for 4 hours. What happens to the disk under PGDATA? How does PostgreSQL handle the situation when archiving is unavailable?