PostgreSQL

PostgreSQL: History and Architecture

2010. Instagram launches. One Postgres server. 100 million users. Everyone said: NoSQL is needed, Postgres won't hold. It held. MVCC let readers see a snapshot of past state while writers were writing - no locks. In 2024, every other startup picks Postgres by default. Michael Stonebraker started INGRES in 1973, rewrote it as Postgres in 1986, received the Turing Award in 2014 - the only one ever awarded for contributions to databases.

  • **Instagram** - 100 million users on one Postgres server, MVCC and JSONB for flexible schema
  • **GitHub** (repository search), **Shopify** (Postgres sharding for millions of stores), **Apple** (iCloud partially) - Postgres wherever reliability is non-negotiable
  • **pgvector** - embeddings directly in Postgres without a separate vector store; **PostGIS** - geospatial data; **JSONB** - documents. One DBMS, endless extensibility

Why PostgreSQL?

**1986. Michael Stonebraker at Berkeley launches the POSTGRES project** - Post-Ingres, the successor to his earlier DBMS INGRES. Core idea: a database should be extensible like regular software. Add custom types, operators, index methods - without patching the kernel. That 1986 decision lives today in PostGIS, pgvector, TimescaleDB - thousands of extensions that make Postgres a universal tool.

PostgreSQL Timeline

1973 - Stonebraker builds INGRES at Berkeley. 1986 - rewrites it as POSTGRES with an extensible architecture. 1995 - SQL added, renamed PostgreSQL. 1996 - first open-source release. 2005 - Windows support. 2017 - logical replication (v10). 2014 - Stonebraker wins the Turing Award. 2024 - PostgreSQL 17, #1 in DB-Engines by growth rate. MySQL lost ground after Oracle's acquisition. Postgres didn't.

**Four pillars of PostgreSQL** that make it the default choice for serious projects:

PropertyWhat it meansIn practice
ACIDAtomicity, Consistency, Isolation, DurabilityData will not be lost even on a power failure
MVCCMulti-Version Concurrency ControlReaders do not block writers and vice versa
ExtensibilityCustom types, operators, and index methodsPostGIS, TimescaleDB, pgvector - all are extensions
SQL conformanceMost complete implementation of the SQL standardWindow functions, CTEs, lateral joins - all built in

**Who uses Postgres?** Apple (iCloud), Instagram (billions of photo rows), Spotify (recommendations), Reddit, Discord, Notion, Supabase. Even Microsoft offers Azure Database for PostgreSQL. In 2024, Postgres surpassed MySQL in developer popularity in the Stack Overflow survey for the first time ever.

**PostgreSQL vs MySQL:** MySQL is faster for simple SELECTs, but PostgreSQL wins on complex queries, large data volumes, and analytics. For a new project in 2024+, PostgreSQL is the safe default choice.

What key architectural decision allows PostgreSQL to support PostGIS, pgvector, and thousands of other extensions?

Process model: process-per-connection

**PostgreSQL uses a process-per-connection model.** Every new connection gets a dedicated OS process with its own address space. MySQL uses threads. Node.js uses an event loop. Postgres uses processes. This was Stonebraker's decision in 1986, and it hasn't changed since.

**A process, not a thread!** Each backend process is a full OS process with its own address space. This means greater isolation (a crash in one client does not take down the server), but also greater memory consumption (~5-10 MB per connection).

**Auxiliary processes** run continuously in the background, maintaining server health:

ProcessResponsibilityWhat happens if stopped
autovacuumCleans up dead rows after UPDATE/DELETETables bloat, performance drops
WAL writerFlushes WAL buffers to diskData loss on crash
checkpointerPeriodically writes dirty pages to diskLong recovery after failure
stats collectorGathers statistics for the query plannerPlanner chooses bad execution plans
background writerGradually writes dirty pagesI/O spikes at checkpoint

**Practical implication:** because of the process-per-connection model, PostgreSQL has a limit on simultaneous connections (typically 100-300). For thousands of clients, a connection pooler is needed - PgBouncer or an application-level pool.

Why does PostgreSQL use a separate process (rather than a thread) for each client?

Shared Memory: memory shared between processes

If every client is a separate process with its own memory, how do they all work with the same data? Through **shared memory** - a region of RAM accessible to all PostgreSQL processes simultaneously. One cache for everyone. One lock table. One WAL buffer. Everything passes through it.

**shared_buffers** is the most important component. The page cache, 8 KB per page. When a backend needs a row - it looks in shared_buffers first. Only if the page isn't cached does it go to disk, then load it into shared_buffers for everyone. The bigger shared_buffers, the fewer disk reads. This is the primary performance lever.

**WAL Buffers** hold the Write-Ahead Log. Before a change reaches disk, it is written to the WAL. This guarantees durability: even after a power failure, PostgreSQL can recover data from the WAL.

**CLOG (Commit Log)** stores the status of every transaction: in progress, committed, or aborted. When MVCC checks row visibility, it consults the CLOG to determine whether the transaction that created a given row has been committed.

**shared_buffers defaults to 128 MB.** That is fine for a developer's laptop, but not for production. Recommendation: 25% of total server RAM. On a 16 GB server, set shared_buffers = 4GB.

A backend process needs a row from the users table. In what order does it look for the data?

Postmaster: the master conductor

**The postmaster is the first and main PostgreSQL process.** Start the server - start the postmaster. Three roles: initialize shared memory, launch auxiliary processes (autovacuum, WAL writer, and others), listen on port 5432, and fork() a new backend process on every incoming connection.

**pg_ctl reload** re-reads the configuration without stopping the server. Most parameters in postgresql.conf can be changed without a restart. However, shared_buffers, max_connections, and a few others require a full restart.

**The postmaster is a single point of failure.** If the postmaster crashes, all clients disconnect. For this reason, the postmaster code is kept as simple and stable as possible - all complex logic lives in backend and auxiliary processes.

**The postmaster.pid file** in the data directory contains the PID, port, and data path. If PostgreSQL terminated abnormally and this file remains, delete it manually before restarting.

shared_buffers was changed in postgresql.conf. What must be done to apply the change?

Client connection: from TCP to query

When an application connects to PostgreSQL, a multi-step protocol unfolds. The client library (libpq, psycopg, node-postgres) establishes a TCP connection with the postmaster. Authentication. Fork of a new backend process. And only then - the first SQL query. This takes 50-100 ms. That's exactly why connection pooling is mandatory in production.

**Every connection = a separate OS process.** With 500 simultaneous connections, there are 500 processes, each consuming ~5-10 MB of RAM. For applications with many clients (microservices, serverless), this can exhaust memory and CPU.

**Connection pooling solves the problem.** PgBouncer or pgcat maintain 20-50 real connections to PostgreSQL and multiplex thousands of client requests through them. Instead of 500 backend processes at 5-10 MB each - 20 processes and a pooler. 25x memory savings.

**In Node.js/NestJS**, a connection pool is already built into the driver (pg, TypeORM). The `max` parameter in pool settings is the number of real connections to PostgreSQL. Do not set it above 20-30 per application instance.

Every SQL query opens a new connection to PostgreSQL

A connection is established once and reused for many queries. In production, connection pooling is mandatory - PgBouncer or a driver-level pool

Establishing a connection is expensive: TCP handshake, authentication, fork() of a new process. Per query this costs 50-100 ms. A connection pool keeps connections open and reuses them, reducing latency to <1 ms

A system has 10 microservices, each with a pool of max=50 connections. How many backend processes will PostgreSQL create at full load?

Key Ideas

  • **PostgreSQL - an extensible, ACID-compliant DBMS** with 35+ years of history. Stonebraker 1986 → Instagram 100M users → Turing Award 2014
  • **Process-per-connection model**: each client gets a dedicated OS process. Reliable - a crash in one doesn't take down the server. But requires PgBouncer at scale
  • **Shared memory (shared_buffers)** - the central cache for all backend processes. Default 128 MB - in production, raise it to 25% of RAM
  • **Postmaster** - the conductor: initializes shared memory, listens on the port, forks a backend on each new connection
  • MVCC is the key mechanism: readers and writers don't block each other. That's exactly how Instagram held 100M users on a single server

Related Topics

PostgreSQL's architecture is the foundation for understanding all subsequent topics:

  • Installation and Configuration — Tuning shared_buffers, max_connections, and other parameters mentioned in this lesson
  • MVCC and Transactions — MVCC operates through shared memory and CLOG - the mechanisms covered in this lesson
  • WAL and Reliability — WAL writer and WAL buffers are part of the architecture seen in shared memory

Вопросы для размышления

  • Instagram in 2010 held 100M users on one Postgres server. Which specific mechanisms from this lesson made that possible? What would have happened without MVCC?
  • An application creates 200 connections to PostgreSQL. How much RAM is consumed by backend processes alone? When does this become a problem?
  • Stonebraker chose process-per-connection over threads in 1986. Was that the right call for 2024? What did it cost, what did it gain?

Связанные уроки

  • pg-02-install-config — Installation and configuration builds directly on the architecture described here
  • alg-01-big-o — Query planning is Big-O analysis applied to database operations
  • ds-01-arrays — PostgreSQL heap storage is essentially an array of 8KB pages
  • st-01-feedback-loops — PostgreSQL's process model implements self-healing feedback loops via postmaster supervision
  • sd-01-intro — System design decisions about SQL vs NoSQL require understanding what PostgreSQL offers
  • db-01-intro
PostgreSQL: History and Architecture

0

1

Sign In