PostgreSQL

PostgreSQL: History and Architecture

2010. Instagram launches. One Postgres server. 100 million users. Everyone said: NoSQL is needed, Postgres won't hold. It held. MVCC let readers see a snapshot of past state while writers were writing - no locks. In 2024, every other startup picks Postgres by default. Michael Stonebraker started INGRES in 1973, rewrote it as Postgres in 1986, received the Turing Award in 2014 - the only one ever awarded for contributions to databases.

**Instagram** - 100 million users on one Postgres server, MVCC and JSONB for flexible schema
**GitHub** (repository search), **Shopify** (Postgres sharding for millions of stores), **Apple** (iCloud partially) - Postgres wherever reliability is non-negotiable
**pgvector** - embeddings directly in Postgres without a separate vector store; **PostGIS** - geospatial data; **JSONB** - documents. One DBMS, endless extensibility

Why PostgreSQL?

**1986. Michael Stonebraker at Berkeley launches the POSTGRES project** - Post-Ingres, the successor to his earlier DBMS INGRES. Core idea: a database should be extensible like regular software. Add custom types, operators, index methods - without patching the kernel. That 1986 decision lives today in PostGIS, pgvector, TimescaleDB - thousands of extensions that make Postgres a universal tool.

PostgreSQL Timeline

1973 - Stonebraker builds INGRES at Berkeley. 1986 - rewrites it as POSTGRES with an extensible architecture. 1995 - SQL added, renamed PostgreSQL. 1996 - first open-source release. 2005 - Windows support. 2017 - logical replication (v10). 2014 - Stonebraker wins the Turing Award. 2024 - PostgreSQL 17, #1 in DB-Engines by growth rate. MySQL lost ground after Oracle's acquisition. Postgres didn't.

**Four pillars of PostgreSQL** that make it the default choice for serious projects:

Property	What it means	In practice
ACID	Atomicity, Consistency, Isolation, Durability	Data will not be lost even on a power failure
MVCC	Multi-Version Concurrency Control	Readers do not block writers and vice versa
Extensibility	Custom types, operators, and index methods	PostGIS, TimescaleDB, pgvector - all are extensions
SQL conformance	Most complete implementation of the SQL standard	Window functions, CTEs, lateral joins - all built in

**Who uses Postgres?** Apple (iCloud), Instagram (billions of photo rows), Spotify (recommendations), Reddit, Discord, Notion, Supabase. Even Microsoft offers Azure Database for PostgreSQL. In 2024, Postgres surpassed MySQL in developer popularity in the Stack Overflow survey for the first time ever.

**PostgreSQL vs MySQL:** MySQL is faster for simple SELECTs, but PostgreSQL wins on complex queries, large data volumes, and analytics. For a new project in 2024+, PostgreSQL is the safe default choice.

What key architectural decision allows PostgreSQL to support PostGIS, pgvector, and thousands of other extensions?

Process model: process-per-connection

**PostgreSQL uses a process-per-connection model.** Every new connection gets a dedicated OS process with its own address space. MySQL uses threads. Node.js uses an event loop. Postgres uses processes. This was Stonebraker's decision in 1986, and it hasn't changed since.

**A process, not a thread!** Each backend process is a full OS process with its own address space. This means greater isolation (a crash in one client does not take down the server), but also greater memory consumption (~5-10 MB per connection).

**Auxiliary processes** run continuously in the background, maintaining server health:

Process	Responsibility	What happens if stopped
autovacuum	Cleans up dead rows after UPDATE/DELETE	Tables bloat, performance drops
WAL writer	Flushes WAL buffers to disk	Data loss on crash
checkpointer	Periodically writes dirty pages to disk	Long recovery after failure
stats collector	Gathers statistics for the query planner	Planner chooses bad execution plans
background writer	Gradually writes dirty pages	I/O spikes at checkpoint

**Practical implication:** because of the process-per-connection model, PostgreSQL has a limit on simultaneous connections (typically 100-300). For thousands of clients, a connection pooler is needed - PgBouncer or an application-level pool.

Why does PostgreSQL use a separate process (rather than a thread) for each client?

Shared Memory: memory shared between processes

If every client is a separate process with its own memory, how do they all work with the same data? Through **shared memory** - a region of RAM accessible to all PostgreSQL processes simultaneously. One cache for everyone. One lock table. One WAL buffer. Everything passes through it.

**shared_buffers** is the most important component. The page cache, 8 KB per page. When a backend needs a row - it looks in shared_buffers first. Only if the page isn't cached does it go to disk, then load it into shared_buffers for everyone. The bigger shared_buffers, the fewer disk reads. This is the primary performance lever.

**WAL Buffers** hold the Write-Ahead Log. Before a change reaches disk, it is written to the WAL. This guarantees durability: even after a power failure, PostgreSQL can recover data from the WAL.

**CLOG (Commit Log)** stores the status of every transaction: in progress, committed, or aborted. When MVCC checks row visibility, it consults the CLOG to determine whether the transaction that created a given row has been committed.

**shared_buffers defaults to 128 MB.** That is fine for a developer's laptop, but not for production. Recommendation: 25% of total server RAM. On a 16 GB server, set shared_buffers = 4GB.

A backend process needs a row from the users table. In what order does it look for the data?

Postmaster: the master conductor

**The postmaster is the first and main PostgreSQL process.** Start the server - start the postmaster. Three roles: initialize shared memory, launch auxiliary processes (autovacuum, WAL writer, and others), listen on port 5432, and fork() a new backend process on every incoming connection.

**pg_ctl reload** re-reads the configuration without stopping the server. Most parameters in postgresql.conf can be changed without a restart. However, shared_buffers, max_connections, and a few others require a full restart.

**The postmaster is a single point of failure.** If the postmaster crashes, all clients disconnect. For this reason, the postmaster code is kept as simple and stable as possible - all complex logic lives in backend and auxiliary processes.

**The postmaster.pid file** in the data directory contains the PID, port, and data path. If PostgreSQL terminated abnormally and this file remains, delete it manually before restarting.

shared_buffers was changed in postgresql.conf. What must be done to apply the change?

Client connection: from TCP to query

When an application connects to PostgreSQL, a multi-step protocol unfolds. The client library (libpq, psycopg, node-postgres) establishes a TCP connection with the postmaster. Authentication. Fork of a new backend process. And only then - the first SQL query. This takes 50-100 ms. That's exactly why connection pooling is mandatory in production.

**Every connection = a separate OS process.** With 500 simultaneous connections, there are 500 processes, each consuming ~5-10 MB of RAM. For applications with many clients (microservices, serverless), this can exhaust memory and CPU.

**Connection pooling solves the problem.** PgBouncer or pgcat maintain 20-50 real connections to PostgreSQL and multiplex thousands of client requests through them. Instead of 500 backend processes at 5-10 MB each - 20 processes and a pooler. 25x memory savings.

**In Node.js/NestJS**, a connection pool is already built into the driver (pg, TypeORM). The `max` parameter in pool settings is the number of real connections to PostgreSQL. Do not set it above 20-30 per application instance.

Every SQL query opens a new connection to PostgreSQL

A connection is established once and reused for many queries. In production, connection pooling is mandatory - PgBouncer or a driver-level pool

Establishing a connection is expensive: TCP handshake, authentication, fork() of a new process. Per query this costs 50-100 ms. A connection pool keeps connections open and reuses them, reducing latency to <1 ms

A system has 10 microservices, each with a pool of max=50 connections. How many backend processes will PostgreSQL create at full load?

Key Ideas

**PostgreSQL - an extensible, ACID-compliant DBMS** with 35+ years of history. Stonebraker 1986 → Instagram 100M users → Turing Award 2014
**Process-per-connection model**: each client gets a dedicated OS process. Reliable - a crash in one doesn't take down the server. But requires PgBouncer at scale
**Shared memory (shared_buffers)** - the central cache for all backend processes. Default 128 MB - in production, raise it to 25% of RAM
**Postmaster** - the conductor: initializes shared memory, listens on the port, forks a backend on each new connection
MVCC is the key mechanism: readers and writers don't block each other. That's exactly how Instagram held 100M users on a single server

Вопросы для размышления

Instagram in 2010 held 100M users on one Postgres server. Which specific mechanisms from this lesson made that possible? What would have happened without MVCC?
An application creates 200 connections to PostgreSQL. How much RAM is consumed by backend processes alone? When does this become a problem?
Stonebraker chose process-per-connection over threads in 1986. Was that the right call for 2024? What did it cost, what did it gain?

Связанные уроки

pg-02-install-config — Installation and configuration builds directly on the architecture described here
alg-01-big-o — Query planning is Big-O analysis applied to database operations
ds-01-arrays — PostgreSQL heap storage is essentially an array of 8KB pages
st-01-feedback-loops — PostgreSQL's process model implements self-healing feedback loops via postmaster supervision
sd-01-intro — System design decisions about SQL vs NoSQL require understanding what PostgreSQL offers
db-01-intro

PostgreSQL

PostgreSQL: History and Architecture

**Instagram** - 100 million users on one Postgres server, MVCC and JSONB for flexible schema
**GitHub** (repository search), **Shopify** (Postgres sharding for millions of stores), **Apple** (iCloud partially) - Postgres wherever reliability is non-negotiable
**pgvector** - embeddings directly in Postgres without a separate vector store; **PostGIS** - geospatial data; **JSONB** - documents. One DBMS, endless extensibility

Why PostgreSQL?

PostgreSQL Timeline

**Four pillars of PostgreSQL** that make it the default choice for serious projects:

Property	What it means	In practice
ACID	Atomicity, Consistency, Isolation, Durability	Data will not be lost even on a power failure
MVCC	Multi-Version Concurrency Control	Readers do not block writers and vice versa
Extensibility	Custom types, operators, and index methods	PostGIS, TimescaleDB, pgvector - all are extensions
SQL conformance	Most complete implementation of the SQL standard	Window functions, CTEs, lateral joins - all built in

What key architectural decision allows PostgreSQL to support PostGIS, pgvector, and thousands of other extensions?

Process model: process-per-connection

**Auxiliary processes** run continuously in the background, maintaining server health:

Process	Responsibility	What happens if stopped
autovacuum	Cleans up dead rows after UPDATE/DELETE	Tables bloat, performance drops
WAL writer	Flushes WAL buffers to disk	Data loss on crash
checkpointer	Periodically writes dirty pages to disk	Long recovery after failure
stats collector	Gathers statistics for the query planner	Planner chooses bad execution plans
background writer	Gradually writes dirty pages	I/O spikes at checkpoint

Why does PostgreSQL use a separate process (rather than a thread) for each client?

Shared Memory: memory shared between processes

**WAL Buffers** hold the Write-Ahead Log. Before a change reaches disk, it is written to the WAL. This guarantees durability: even after a power failure, PostgreSQL can recover data from the WAL.

**shared_buffers defaults to 128 MB.** That is fine for a developer's laptop, but not for production. Recommendation: 25% of total server RAM. On a 16 GB server, set shared_buffers = 4GB.

A backend process needs a row from the users table. In what order does it look for the data?

Postmaster: the master conductor

**The postmaster.pid file** in the data directory contains the PID, port, and data path. If PostgreSQL terminated abnormally and this file remains, delete it manually before restarting.

shared_buffers was changed in postgresql.conf. What must be done to apply the change?

Client connection: from TCP to query

Every SQL query opens a new connection to PostgreSQL

A connection is established once and reused for many queries. In production, connection pooling is mandatory - PgBouncer or a driver-level pool

A system has 10 microservices, each with a pool of max=50 connections. How many backend processes will PostgreSQL create at full load?

Key Ideas

**PostgreSQL - an extensible, ACID-compliant DBMS** with 35+ years of history. Stonebraker 1986 → Instagram 100M users → Turing Award 2014
**Process-per-connection model**: each client gets a dedicated OS process. Reliable - a crash in one doesn't take down the server. But requires PgBouncer at scale
**Shared memory (shared_buffers)** - the central cache for all backend processes. Default 128 MB - in production, raise it to 25% of RAM
**Postmaster** - the conductor: initializes shared memory, listens on the port, forks a backend on each new connection
MVCC is the key mechanism: readers and writers don't block each other. That's exactly how Instagram held 100M users on a single server

Вопросы для размышления

Instagram in 2010 held 100M users on one Postgres server. Which specific mechanisms from this lesson made that possible? What would have happened without MVCC?
An application creates 200 connections to PostgreSQL. How much RAM is consumed by backend processes alone? When does this become a problem?
Stonebraker chose process-per-connection over threads in 1986. Was that the right call for 2024? What did it cost, what did it gain?

Связанные уроки

pg-02-install-config — Installation and configuration builds directly on the architecture described here
alg-01-big-o — Query planning is Big-O analysis applied to database operations
ds-01-arrays — PostgreSQL heap storage is essentially an array of 8KB pages
st-01-feedback-loops — PostgreSQL's process model implements self-healing feedback loops via postmaster supervision
sd-01-intro — System design decisions about SQL vs NoSQL require understanding what PostgreSQL offers
db-01-intro

PostgreSQL: History and Architecture

Why PostgreSQL?

PostgreSQL Timeline

Process model: process-per-connection

Shared Memory: memory shared between processes

Postmaster: the master conductor

Client connection: from TCP to query

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

PostgreSQL: History and Architecture

Why PostgreSQL?

PostgreSQL Timeline

Process model: process-per-connection

Shared Memory: memory shared between processes

Postmaster: the master conductor

Client connection: from TCP to query

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки