PostgreSQL

High Availability: Patroni, etcd, failover

Black Friday 2019. Shopify processes $2.9 billion in 24 hours. The primary PostgreSQL server crashes at 14:03. By 14:04:30, Patroni has completed the failover. 90 seconds, zero lost transactions. Without HA, that becomes hours of manual recovery and millions in lost revenue. The difference between catastrophe and incident: Patroni + etcd + HAProxy configured ahead of time.

**Zalando** authors of Patroni, run it for 500+ PostgreSQL clusters: automatic failover < 30 seconds, scheduled daily switchovers for config updates without downtime
**GitLab.com** Patroni + Consul: 99.99% uptime, primary + 2 standbys, every failover from 2020 to 2024 happened without data loss, thin PgBouncer layer in front of every cluster
**Timescale Cloud** Patroni powering of every database: automatic failover, transparent PgBouncer, HAProxy read/write split as a managed service

The High Availability Concept

**High Availability (HA) for PostgreSQL means automatic service recovery after a primary failure with no manual intervention.** Three components: failure detection, leader election, and traffic routing. PostgreSQL by itself does not provide automatic failover. External tools are required.

**RTO and RPO** are the key HA metrics. RTO (Recovery Time Objective) is the maximum recovery time. RPO (Recovery Point Objective) is the maximum acceptable data loss. Patroni with async replication: RTO ~30 sec, RPO > 0. With sync replication: RPO = 0, RTO ~30 sec, but higher latency.

A financial application requires RPO = 0 (zero data loss) and RTO < 1 minute. What do you need to enable?

Patroni + etcd: Leader Election

**Patroni** is a Python agent for managing a PostgreSQL cluster. It runs on each node and talks to a DCS (Distributed Configuration Store: etcd, Consul, or ZooKeeper). The DCS stores cluster state: who the leader is, the cluster configuration, and the TTL of the leader lock. If the leader fails to refresh the TTL lock, it is considered down and election of a new leader begins.

Patroni TTL = 30 seconds. The primary hangs (does not respond, but the process is alive). After how many seconds will failover begin?

Automatic Failover

**Failover through Patroni:** TTL expires in etcd, the standby with the lowest lag grabs the lock, becomes the new primary, Patroni reconfigures the remaining standbys, and pg_rewind resynchronizes the old primary when it returns. Total time: 30 to 60 seconds.

**pg_rewind** (use_pg_rewind: true) synchronizes only changed pages. It is much faster than a full pg_basebackup. It also prevents split-brain: a node without a DCS lock stops PostgreSQL (fencing).

The old primary returns after failover. Patroni detects the new leader in etcd. What happens?

PgBouncer: Connection Pooling in HA

**PgBouncer in the HA stack** limits the number of connections to PostgreSQL and buffers queries during failover. During failover, PgBouncer pauses client requests for a few seconds while it reconnects to the new primary. Clients see latency, not an error.

Pool mode	PG connection lifetime	Use case
session	For the entire client session	SET, PREPARE, advisory locks
transaction	Only during a transaction	Most applications (recommended)
statement	For a single SQL command	Only without multi-statement transactions

An application runs `SET search_path = myschema` at the start of a session. PgBouncer is in transaction pool mode. What happens?

HAProxy: Routing to Primary and Standby

**HAProxy routes requests to the correct PostgreSQL node.** Patroni exposes a REST API (port 8008): `/primary` returns 200 only for the current primary, `/replica` for standbys. HAProxy polls these endpoints and updates routing automatically on failover.

**Two ports for two roles** is the standard pattern: :5000 for writes (primary), :5001 for reads (any standby). The application keeps two connection strings. On failover both update automatically via HAProxy health checks.

HA for PostgreSQL only requires streaming replication

Streaming replication is only the data transport. A full HA stack consists of Patroni (cluster management + failover), etcd (distributed consensus without split-brain), PgBouncer (connection pooling), and HAProxy (traffic routing)

Streaming replication does not reroute traffic automatically, does not prevent split-brain, and does not manage client connection lifecycle. With manual failover and no surrounding components, you face 10 to 30 minutes of DBA work, errors for every client, and a risk of data loss from an incorrect promote

HAProxy polls the Patroni REST API every 3 seconds (inter=3s, fall=3). After the primary fails, what is the maximum delay before HAProxy stops sending traffic to it?

Key Ideas

**HA stack**: Patroni (cluster manager) + etcd (consensus) + PgBouncer (pooling) + HAProxy (routing). Streaming replication is just transport
**Failover in ~30 to 60 sec**: TTL expires in etcd, election, promote, pg_rewind for the old primary, routing update through HAProxy
**Fencing through DCS**: a node without the etcd lock stops PostgreSQL, preventing split-brain
**PgBouncer transaction mode**: optimal for most applications, but breaks session-level SET and prepared statements
**Patroni REST API** `/primary` and `/replica`: the health check mechanism for HAProxy and any load balancer

Вопросы для размышления

Patroni TTL = 30 seconds, maximum_lag_on_failover = 1 MB. Standby-1 is 2 MB behind, Standby-2 is 0.5 MB behind. The primary fails. Who becomes the new primary and why?
An application uses PREPARE statements through PgBouncer in transaction mode. After a few hours, errors appear: 'prepared statement does not exist'. What is happening and how do you fix it?
A third datacenter is added for etcd. DC1 is cut off from the network (network partition), but the PostgreSQL primary there keeps running. How does Patroni react? Is there a risk of split-brain?

Связанные уроки

dist-11-replication

PostgreSQL

High Availability: Patroni, etcd, failover

**Zalando** authors of Patroni, run it for 500+ PostgreSQL clusters: automatic failover < 30 seconds, scheduled daily switchovers for config updates without downtime
**GitLab.com** Patroni + Consul: 99.99% uptime, primary + 2 standbys, every failover from 2020 to 2024 happened without data loss, thin PgBouncer layer in front of every cluster
**Timescale Cloud** Patroni powering of every database: automatic failover, transparent PgBouncer, HAProxy read/write split as a managed service

The High Availability Concept

A financial application requires RPO = 0 (zero data loss) and RTO < 1 minute. What do you need to enable?

Patroni + etcd: Leader Election

Patroni TTL = 30 seconds. The primary hangs (does not respond, but the process is alive). After how many seconds will failover begin?

Automatic Failover

**pg_rewind** (use_pg_rewind: true) synchronizes only changed pages. It is much faster than a full pg_basebackup. It also prevents split-brain: a node without a DCS lock stops PostgreSQL (fencing).

The old primary returns after failover. Patroni detects the new leader in etcd. What happens?

PgBouncer: Connection Pooling in HA

Pool mode	PG connection lifetime	Use case
session	For the entire client session	SET, PREPARE, advisory locks
transaction	Only during a transaction	Most applications (recommended)
statement	For a single SQL command	Only without multi-statement transactions

An application runs `SET search_path = myschema` at the start of a session. PgBouncer is in transaction pool mode. What happens?

HAProxy: Routing to Primary and Standby

HA for PostgreSQL only requires streaming replication

HAProxy polls the Patroni REST API every 3 seconds (inter=3s, fall=3). After the primary fails, what is the maximum delay before HAProxy stops sending traffic to it?

Key Ideas

**HA stack**: Patroni (cluster manager) + etcd (consensus) + PgBouncer (pooling) + HAProxy (routing). Streaming replication is just transport
**Failover in ~30 to 60 sec**: TTL expires in etcd, election, promote, pg_rewind for the old primary, routing update through HAProxy
**Fencing through DCS**: a node without the etcd lock stops PostgreSQL, preventing split-brain
**PgBouncer transaction mode**: optimal for most applications, but breaks session-level SET and prepared statements
**Patroni REST API** `/primary` and `/replica`: the health check mechanism for HAProxy and any load balancer

Вопросы для размышления

Patroni TTL = 30 seconds, maximum_lag_on_failover = 1 MB. Standby-1 is 2 MB behind, Standby-2 is 0.5 MB behind. The primary fails. Who becomes the new primary and why?
An application uses PREPARE statements through PgBouncer in transaction mode. After a few hours, errors appear: 'prepared statement does not exist'. What is happening and how do you fix it?
A third datacenter is added for etcd. DC1 is cut off from the network (network partition), but the PostgreSQL primary there keeps running. How does Patroni react? Is there a risk of split-brain?

Связанные уроки

dist-11-replication

High Availability: Patroni, etcd, failover

The High Availability Concept

Patroni + etcd: Leader Election

Automatic Failover

PgBouncer: Connection Pooling in HA

HAProxy: Routing to Primary and Standby

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

High Availability: Patroni, etcd, failover

The High Availability Concept

Patroni + etcd: Leader Election

Automatic Failover

PgBouncer: Connection Pooling in HA

HAProxy: Routing to Primary and Standby

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки