PostgreSQL
Data Types: From integer to jsonb
A developer creates a table: `price FLOAT, created_at TIMESTAMP, id SERIAL`. Three columns, three mistakes. float silently sheds cents across millions of transactions - one fintech startup lost USD 12,000 in a month from exactly this bug. timestamp without timezone shows different times in Moscow and New York. SERIAL is legacy with surprise edge cases. Picking the right data types is not pedantry - it is insurance against bugs that surface six months into production.
- **A fintech startup** lost USD 12,000 in a single month due to float rounding in commission calculations - switching to numeric solved the problem immediately
- **A SaaS with users in 20 countries** - timestamptz automatically shows the correct local time to every user without a single line of conversion code
- **An API platform** uses UUID for public identifiers - it is impossible to guess another user's ID or infer the total record count
Предварительные знания
Michael Stonebraker and the Extensibility of Postgres
In 1986, Michael Stonebraker of UC Berkeley designed Postgres (Post-Ingres) around type extensibility - a developer adds a new data type the same way a language designer adds a new class. That decision opened the door to jsonb, hstore, PostGIS, and vector types. Stonebraker won the Turing Award in 2014. Thirty-eight years later, the same extensible-types architecture powers pgvector and modern vector database integrations.
Numeric Types: From smallint to numeric
A fintech startup lost USD 12,000 in one month. Root cause: `price FLOAT`. Rounding in commission calculations stacked up transaction by transaction. PostgreSQL ships three integer sizes, floating-point numbers, and a numeric type for exact arithmetic. The wrong pick burns disk space or, far worse, burns money.
| Type | Size | Range | When to use |
|---|---|---|---|
| smallint (int2) | 2 bytes | -32,768 .. 32,767 | Status codes, small counters, ages |
| integer (int4) | 4 bytes | -2.1B .. 2.1B | IDs, quantities, the majority of use cases |
| bigint (int8) | 8 bytes | ±9.2 × 10¹⁸ | Financial amounts in cents, large counters, IDs in high-load systems |
| numeric(p,s) | variable | up to 131,072 digits | Money, exact calculations (no precision loss) |
| real (float4) | 4 bytes | 6 significant digits | Scientific computations, coordinates (where approximation is acceptable) |
| double precision | 8 bytes | 15 significant digits | Statistics, ML - where speed matters more than exactness |
**Never store money in real or double precision.** Demo: `SELECT 0.1::real + 0.2::real` → `0.30000001`. In financial systems processing millions of transactions, those rounding errors compound. Use `numeric(p,s)`, or store amounts in cents as `bigint`.
**Selection rule:** if a value fits in integer (up to 2 billion), grab integer. For IDs in high-load systems, jump to bigint. For money, numeric(p,s). For scientific calculations where some imprecision is acceptable, double precision.
A table for bank transactions is being designed. Which type should be used for the amount column?
String Types: char, varchar, text
**PostgreSQL ships three string types, and in practice they are nearly interchangeable.** This catches MySQL refugees off guard - in MySQL, varchar(255) and text are different beasts with different performance profiles.
| Type | Constraint | Performance | When to use |
|---|---|---|---|
| char(n) | Fixed length, padded with spaces | No advantage over text | Almost never (legacy) |
| varchar(n) | Maximum n characters | Identical to text | When a length limit is a database-level business rule |
| text | Unlimited (up to 1 GB) | Identical to varchar | The general default choice |
**In PostgreSQL, text, varchar, and varchar(n) share one internal storage mechanism: varlena.** All three store the same way and perform the same way. varchar(n) just adds a length check on write. Not like MySQL, where varchar(255) and text live in different storage layouts.
**LIKE '%something%'** (leading %) cannot use a regular B-tree index - PostgreSQL falls back to a full table scan. For substring search, reach for the pg_trgm extension with a GIN index, or use full-text search.
**PostgreSQL core team recommendation:** default to `text`. Reach for `varchar(n)` only when a length limit is a genuine business rule (email, phone). Do not reflex-type varchar(255) out of MySQL habit - it gains nothing in PostgreSQL.
What is the performance difference between text and varchar(255) in PostgreSQL?
Date and Time: timestamptz Forever
A SaaS serving users in 20 countries. One events table with `created_at TIMESTAMP`. Server in UTC, users in Tokyo, Moscow, New York. Timezone bugs piled up across three releases. One rule fixes it: **always use timestamptz** (timestamp with time zone).
| Type | Size | What it stores | Recommendation |
|---|---|---|---|
| timestamp | 8 bytes | Date + time WITHOUT timezone | Avoid (source of bugs) |
| timestamptz | 8 bytes | Date + time IN UTC (converts on input/output) | Use ALWAYS |
| date | 4 bytes | Date only (2024-03-15) | Birthdays, dates without time |
| time | 8 bytes | Time only (14:30:00) | Schedules (rarely needed) |
| interval | 16 bytes | Duration (3 days 04:05:06) | Difference between dates, durations |
**timestamp (without tz) breeds elusive bugs.** The server runs in UTC; a Moscow user writes '2024-03-15 15:00'. PostgreSQL stores 15:00 with no timezone marker. Another service assumes UTC and renders 15:00 as 18:00 Moscow time. The data is corrupted, and nobody can tell when the event actually happened.
**timestamptz does NOT store the timezone.** Internally it is just microseconds since 2000-01-01 UTC (8 bytes, same as timestamp). Writes convert to UTC; reads convert back to the session timezone. The name "with time zone" really means "timezone-aware during conversion".
**For NestJS/TypeORM:** entities use `@Column({ type: 'timestamptz' })`. Migrations use `timestamp with time zone`. Configure the PostgreSQL server with `timezone = 'UTC'`. Run display conversion on the frontend.
A user in Moscow (UTC+3) recorded an event at 15:00. What will a user in London (UTC+0) see if the column type is timestamptz?
Identifiers: SERIAL, IDENTITY, UUID
**Every row needs a unique identifier - a Primary Key.** PostgreSQL ships three options: legacy SERIAL, modern GENERATED ALWAYS AS IDENTITY, and UUID. The choice ripples into security, replication performance, and everyday usability.
**UUID is the alternative to numeric IDs** for distributed systems, public APIs, and any case where leaking the creation order of records is unacceptable.
| Approach | Size | Pros | Cons |
|---|---|---|---|
| integer + IDENTITY | 4 bytes | Compact, fast JOINs, B-tree friendly | Predictable (IDs can be guessed), issues with merge replication |
| bigint + IDENTITY | 8 bytes | Like integer, but up to 9.2×10¹⁸ | Same predictability drawbacks |
| UUID v4 (random) | 16 bytes | Globally unique, safe for public APIs | Poor B-tree locality → fragmentation, 2x the size of integer |
| UUID v7 (time-sorted) | 16 bytes | Unique + naturally time-sortable | Relatively new standard (2024+) |
**UUID v7** (RFC 9562) embeds a timestamp in the first 48 bits, giving natural time-based ordering. That solves UUID v4's biggest problem: random values produce terrible B-tree locality, leading to page splits and fragmentation. PostgreSQL 17+ supports `uuidv7()` via an extension.
**Never expose sequential numeric IDs in public APIs.** A URL like `/users/42` advertises that the system has only 42 users. A competitor walks `/users/1` through `/users/10000` and scrapes every profile. UUID kills that vector: `/users/f47ac10b-58cc-4372-a567-0e02b2c3d479` is unguessable.
**Practical advice:** integer/bigint IDENTITY for internal IDs (foreign keys, JOINs); UUID for external ones (APIs, public links). The two can coexist: `id integer IDENTITY` + `public_id uuid DEFAULT gen_random_uuid()`.
What makes GENERATED ALWAYS AS IDENTITY better than SERIAL?
JSONB and Arrays: Semi-Structured Data
**JSONB is one of PostgreSQL's killer features.** MongoDB-style flexibility wrapped in ACID guarantees and JOINs. Stash semi-structured data (settings, metadata, nested objects) right inside a relational table, index it, and query by content.
**PostgreSQL arrays** are a native type for lists of homogeneous values. Distinct from JSON arrays - separate type, separate operators.
**json vs jsonb:** PostgreSQL ships two JSON types. `json` stores the raw text (duplicate keys and whitespace preserved). `jsonb` is a binary format: parsed on write, keys deduplicated, indexable. Always pick `jsonb`.
**JSONB vs separate columns?** If a field gets filtered or joined often, promote it to its own column. JSONB shines for metadata, settings, and logs - data with a variable structure that rarely shows up in JOINs.
How to find all events where payload contains the key "source" with value "google"?
Custom Types: enum, composite, domain
**PostgreSQL lets developers define custom data types.** That extensibility traces back to Stonebraker's 1986 design. The three most useful flavors: enum (enumerations), composite (structured types), and domain (types with constraints).
**Adding an ENUM value is easy; removing one is not (pre-PostgreSQL 16).** `ALTER TYPE order_status ADD VALUE 'refunded' AFTER 'cancelled'` works fine. Removing or renaming values demands recreating the entire type. Think twice before using ENUM for value sets that churn often.
**Domain vs CHECK constraint:** both validate data, but a domain is reusable. Define the `email` domain once, use it across 10 tables. Change the rule once with `ALTER DOMAIN email ADD CHECK(...)` and it propagates everywhere automatically.
| Type | TypeScript analogy | When to use |
|---|---|---|
| ENUM | enum Status { ... } | Statuses, roles, categories - fixed value sets |
| Composite | interface Address { ... } | Repeating groups of fields (address, coordinates) |
| Domain | type Email = string & Brand | Format validation: email, phone numbers, positive amounts |
**Alternative to ENUM for fast-moving lists:** a lookup table with a foreign key. `CREATE TABLE statuses (id serial, name text unique)` + `REFERENCES statuses(id)`. Values come and go freely, no type recreation required.
JSONB can replace the relational model - store everything in a single JSONB column and forget about JOINs and normalization
JSONB is designed for semi-structured data with a variable schema (metadata, settings, logs). Core application data should live in normalized columns with proper types and foreign keys.
JSONB does not support foreign keys (no referential integrity), occupies more space than typed columns, and is harder to index for composite conditions. The query `WHERE data->>'user_id' = '42'` is slower and less safe than `WHERE user_id = 42` with a properly typed integer column and a foreign key. PostgreSQL's real strength lies in combining the relational model with JSONB for flexible supplementary fields.
Email addresses need to be stored in 15 tables with the same format validation. What is the best approach?
Key Takeaways
- **Numbers:** integer for most cases, bigint for large values, numeric(p,s) for money. Never float for financial data.
- **Strings:** text by default, varchar(n) only when a length limit is a genuine business requirement. There is no performance difference in PostgreSQL.
- **Time:** always timestamptz. Never plain timestamp. This will save from elusive timezone bugs.
- **Identifiers:** GENERATED ALWAYS AS IDENTITY instead of SERIAL. UUID for public-facing APIs.
- **JSONB:** for semi-structured data with a GIN index. But the core data model should remain relational.
- Three mistakes from the hook: `price numeric(12,2)`, `created_at timestamptz`, `id integer GENERATED ALWAYS AS IDENTITY`.
Related Topics
Data types affect indexing, performance, and schema design:
- DDL: Creating Tables — Applying data types in CREATE TABLE, ALTER TABLE, and constraints
- JSONB in Depth — Advanced JSONB operations: jsonpath, updating nested fields, performance tuning
- B-tree Indexes — How the choice of data type affects index size and search speed
Вопросы для размышления
- Why did PostgreSQL not make numeric the default type for all numbers, given its precision advantage over float? Consider performance and storage size.
- In what situations would a JSONB column be chosen over individual typed columns? And when would the opposite apply - extracting data from JSONB into a normalized relational structure?
- A colleague proposes using UUID for all IDs for consistency. What arguments would be made for and against this approach?
Связанные уроки
- pg-04-ddl — DDL is the practical application of types in CREATE TABLE and ALTER TABLE
- pg-12-btree — Type choice directly affects B-tree index size and search speed
- pg-32-jsonb — Advanced JSONB operations: jsonpath, updating nested fields
- db-03-acid — Correct types (numeric for money) are the foundation of ACID Consistency
- pg-24-mvcc — MVCC versions rows - type affects tuple size and MVCC overhead
- db-02-relational-model