PostgreSQL

Data Types: From integer to jsonb

A developer creates a table: `price FLOAT, created_at TIMESTAMP, id SERIAL`. Three columns, three mistakes. float silently sheds cents across millions of transactions - one fintech startup lost USD 12,000 in a month from exactly this bug. timestamp without timezone shows different times in Moscow and New York. SERIAL is legacy with surprise edge cases. Picking the right data types is not pedantry - it is insurance against bugs that surface six months into production.

**A fintech startup** lost USD 12,000 in a single month due to float rounding in commission calculations - switching to numeric solved the problem immediately
**A SaaS with users in 20 countries** - timestamptz automatically shows the correct local time to every user without a single line of conversion code
**An API platform** uses UUID for public identifiers - it is impossible to guess another user's ID or infer the total record count

Предварительные знания

Installation and Configuration

Michael Stonebraker and the Extensibility of Postgres

In 1986, Michael Stonebraker of UC Berkeley designed Postgres (Post-Ingres) around type extensibility - a developer adds a new data type the same way a language designer adds a new class. That decision opened the door to jsonb, hstore, PostGIS, and vector types. Stonebraker won the Turing Award in 2014. Thirty-eight years later, the same extensible-types architecture powers pgvector and modern vector database integrations.

Numeric Types: From smallint to numeric

A fintech startup lost USD 12,000 in one month. Root cause: `price FLOAT`. Rounding in commission calculations stacked up transaction by transaction. PostgreSQL ships three integer sizes, floating-point numbers, and a numeric type for exact arithmetic. The wrong pick burns disk space or, far worse, burns money.

Type	Size	Range	When to use
smallint (int2)	2 bytes	-32,768 .. 32,767	Status codes, small counters, ages
integer (int4)	4 bytes	-2.1B .. 2.1B	IDs, quantities, the majority of use cases
bigint (int8)	8 bytes	±9.2 × 10¹⁸	Financial amounts in cents, large counters, IDs in high-load systems
numeric(p,s)	variable	up to 131,072 digits	Money, exact calculations (no precision loss)
real (float4)	4 bytes	6 significant digits	Scientific computations, coordinates (where approximation is acceptable)
double precision	8 bytes	15 significant digits	Statistics, ML - where speed matters more than exactness

**Never store money in real or double precision.** Demo: `SELECT 0.1::real + 0.2::real` → `0.30000001`. In financial systems processing millions of transactions, those rounding errors compound. Use `numeric(p,s)`, or store amounts in cents as `bigint`.

**Selection rule:** if a value fits in integer (up to 2 billion), grab integer. For IDs in high-load systems, jump to bigint. For money, numeric(p,s). For scientific calculations where some imprecision is acceptable, double precision.

A table for bank transactions is being designed. Which type should be used for the amount column?

String Types: char, varchar, text

**PostgreSQL ships three string types, and in practice they are nearly interchangeable.** This catches MySQL refugees off guard - in MySQL, varchar(255) and text are different beasts with different performance profiles.

Type	Constraint	Performance	When to use
char(n)	Fixed length, padded with spaces	No advantage over text	Almost never (legacy)
varchar(n)	Maximum n characters	Identical to text	When a length limit is a database-level business rule
text	Unlimited (up to 1 GB)	Identical to varchar	The general default choice

**In PostgreSQL, text, varchar, and varchar(n) share one internal storage mechanism: varlena.** All three store the same way and perform the same way. varchar(n) just adds a length check on write. Not like MySQL, where varchar(255) and text live in different storage layouts.

**LIKE '%something%'** (leading %) cannot use a regular B-tree index - PostgreSQL falls back to a full table scan. For substring search, reach for the pg_trgm extension with a GIN index, or use full-text search.

**PostgreSQL core team recommendation:** default to `text`. Reach for `varchar(n)` only when a length limit is a genuine business rule (email, phone). Do not reflex-type varchar(255) out of MySQL habit - it gains nothing in PostgreSQL.

What is the performance difference between text and varchar(255) in PostgreSQL?

Date and Time: timestamptz Forever

A SaaS serving users in 20 countries. One events table with `created_at TIMESTAMP`. Server in UTC, users in Tokyo, Moscow, New York. Timezone bugs piled up across three releases. One rule fixes it: **always use timestamptz** (timestamp with time zone).

Type	Size	What it stores	Recommendation
timestamp	8 bytes	Date + time WITHOUT timezone	Avoid (source of bugs)
timestamptz	8 bytes	Date + time IN UTC (converts on input/output)	Use ALWAYS
date	4 bytes	Date only (2024-03-15)	Birthdays, dates without time
time	8 bytes	Time only (14:30:00)	Schedules (rarely needed)
interval	16 bytes	Duration (3 days 04:05:06)	Difference between dates, durations

**timestamp (without tz) breeds elusive bugs.** The server runs in UTC; a Moscow user writes '2024-03-15 15:00'. PostgreSQL stores 15:00 with no timezone marker. Another service assumes UTC and renders 15:00 as 18:00 Moscow time. The data is corrupted, and nobody can tell when the event actually happened.

**timestamptz does NOT store the timezone.** Internally it is just microseconds since 2000-01-01 UTC (8 bytes, same as timestamp). Writes convert to UTC; reads convert back to the session timezone. The name "with time zone" really means "timezone-aware during conversion".

**For NestJS/TypeORM:** entities use `@Column({ type: 'timestamptz' })`. Migrations use `timestamp with time zone`. Configure the PostgreSQL server with `timezone = 'UTC'`. Run display conversion on the frontend.

A user in Moscow (UTC+3) recorded an event at 15:00. What will a user in London (UTC+0) see if the column type is timestamptz?

Identifiers: SERIAL, IDENTITY, UUID

**Every row needs a unique identifier - a Primary Key.** PostgreSQL ships three options: legacy SERIAL, modern GENERATED ALWAYS AS IDENTITY, and UUID. The choice ripples into security, replication performance, and everyday usability.

**UUID is the alternative to numeric IDs** for distributed systems, public APIs, and any case where leaking the creation order of records is unacceptable.

Approach	Size	Pros	Cons
integer + IDENTITY	4 bytes	Compact, fast JOINs, B-tree friendly	Predictable (IDs can be guessed), issues with merge replication
bigint + IDENTITY	8 bytes	Like integer, but up to 9.2×10¹⁸	Same predictability drawbacks
UUID v4 (random)	16 bytes	Globally unique, safe for public APIs	Poor B-tree locality → fragmentation, 2x the size of integer
UUID v7 (time-sorted)	16 bytes	Unique + naturally time-sortable	Relatively new standard (2024+)

**UUID v7** (RFC 9562) embeds a timestamp in the first 48 bits, giving natural time-based ordering. That solves UUID v4's biggest problem: random values produce terrible B-tree locality, leading to page splits and fragmentation. PostgreSQL 17+ supports `uuidv7()` via an extension.

**Never expose sequential numeric IDs in public APIs.** A URL like `/users/42` advertises that the system has only 42 users. A competitor walks `/users/1` through `/users/10000` and scrapes every profile. UUID kills that vector: `/users/f47ac10b-58cc-4372-a567-0e02b2c3d479` is unguessable.

**Practical advice:** integer/bigint IDENTITY for internal IDs (foreign keys, JOINs); UUID for external ones (APIs, public links). The two can coexist: `id integer IDENTITY` + `public_id uuid DEFAULT gen_random_uuid()`.

What makes GENERATED ALWAYS AS IDENTITY better than SERIAL?

JSONB and Arrays: Semi-Structured Data

**JSONB is one of PostgreSQL's killer features.** MongoDB-style flexibility wrapped in ACID guarantees and JOINs. Stash semi-structured data (settings, metadata, nested objects) right inside a relational table, index it, and query by content.

**PostgreSQL arrays** are a native type for lists of homogeneous values. Distinct from JSON arrays - separate type, separate operators.

**json vs jsonb:** PostgreSQL ships two JSON types. `json` stores the raw text (duplicate keys and whitespace preserved). `jsonb` is a binary format: parsed on write, keys deduplicated, indexable. Always pick `jsonb`.

**JSONB vs separate columns?** If a field gets filtered or joined often, promote it to its own column. JSONB shines for metadata, settings, and logs - data with a variable structure that rarely shows up in JOINs.

How to find all events where payload contains the key "source" with value "google"?

Custom Types: enum, composite, domain

**PostgreSQL lets developers define custom data types.** That extensibility traces back to Stonebraker's 1986 design. The three most useful flavors: enum (enumerations), composite (structured types), and domain (types with constraints).

**Adding an ENUM value is easy; removing one is not (pre-PostgreSQL 16).** `ALTER TYPE order_status ADD VALUE 'refunded' AFTER 'cancelled'` works fine. Removing or renaming values demands recreating the entire type. Think twice before using ENUM for value sets that churn often.

**Domain vs CHECK constraint:** both validate data, but a domain is reusable. Define the `email` domain once, use it across 10 tables. Change the rule once with `ALTER DOMAIN email ADD CHECK(...)` and it propagates everywhere automatically.

Type	TypeScript analogy	When to use
ENUM	enum Status { ... }	Statuses, roles, categories - fixed value sets
Composite	interface Address { ... }	Repeating groups of fields (address, coordinates)
Domain	type Email = string & Brand	Format validation: email, phone numbers, positive amounts

**Alternative to ENUM for fast-moving lists:** a lookup table with a foreign key. `CREATE TABLE statuses (id serial, name text unique)` + `REFERENCES statuses(id)`. Values come and go freely, no type recreation required.

JSONB can replace the relational model - store everything in a single JSONB column and forget about JOINs and normalization

JSONB is designed for semi-structured data with a variable schema (metadata, settings, logs). Core application data should live in normalized columns with proper types and foreign keys.

JSONB does not support foreign keys (no referential integrity), occupies more space than typed columns, and is harder to index for composite conditions. The query `WHERE data->>'user_id' = '42'` is slower and less safe than `WHERE user_id = 42` with a properly typed integer column and a foreign key. PostgreSQL's real strength lies in combining the relational model with JSONB for flexible supplementary fields.

Email addresses need to be stored in 15 tables with the same format validation. What is the best approach?

Key Takeaways

**Numbers:** integer for most cases, bigint for large values, numeric(p,s) for money. Never float for financial data.
**Strings:** text by default, varchar(n) only when a length limit is a genuine business requirement. There is no performance difference in PostgreSQL.
**Time:** always timestamptz. Never plain timestamp. This will save from elusive timezone bugs.
**Identifiers:** GENERATED ALWAYS AS IDENTITY instead of SERIAL. UUID for public-facing APIs.
**JSONB:** for semi-structured data with a GIN index. But the core data model should remain relational.
Three mistakes from the hook: `price numeric(12,2)`, `created_at timestamptz`, `id integer GENERATED ALWAYS AS IDENTITY`.

Вопросы для размышления

Why did PostgreSQL not make numeric the default type for all numbers, given its precision advantage over float? Consider performance and storage size.
In what situations would a JSONB column be chosen over individual typed columns? And when would the opposite apply - extracting data from JSONB into a normalized relational structure?
A colleague proposes using UUID for all IDs for consistency. What arguments would be made for and against this approach?

Связанные уроки

pg-04-ddl — DDL is the practical application of types in CREATE TABLE and ALTER TABLE
pg-12-btree — Type choice directly affects B-tree index size and search speed
pg-32-jsonb — Advanced JSONB operations: jsonpath, updating nested fields
db-03-acid — Correct types (numeric for money) are the foundation of ACID Consistency
pg-24-mvcc — MVCC versions rows - type affects tuple size and MVCC overhead
db-02-relational-model