Databases

Why Databases Exist

The 1980s. Banks store transactions in CSV files. Two processes read the same balance simultaneously: both see 1,000 dollars, both deduct 700. Result: 300 dollars instead of a correct rejection. This bug is called lost update - the first problem databases were built to solve. Edgar Codd formulated the relational model in 1970 at IBM. His managers called the idea impractical. Today Oracle earns 50 billion dollars a year.

**MySQL** powers WordPress and Shopify - millions of stores, tens of thousands of queries per second on standard hardware
**PostgreSQL** at Instagram and Heroku: strict types, JSONB, extensions - a relational model that doesn't get in the way
**Oracle** at banks and airlines - where ACID guarantees literally mean money doesn't disappear
**SQLite** on every iPhone and in Chrome: 4 billion devices, zero servers
**Polyglot persistence**: Shopify uses MySQL + Redis + ClickHouse simultaneously. Each database does what it's best at

Files vs Databases

The 1980s. A bank. Two tellers open the same balance file simultaneously. Both read: account balance is 1,000 dollars. Both deduct 700. Both save. Final balance: 300 dollars instead of a correct rejection of the second operation. 700 dollars vanish. This is **lost update** - and it happens every time two processes read and write the same file concurrently.

Lost update is just the tip. Files create an entire class of problems that 1970s engineers patched with workarounds - until Edgar Codd published his paper.

Criterion	Files (CSV/JSON)	Database (DBMS)
Concurrent access	Overwrites each other	Transactions and locks
Search by field	O(n) - full file scan	O(log n) - indexes (B-tree)
Data integrity	No guarantees	Constraints, types, FK
Failure during write	File can be corrupted	Transactions: all or nothing
Relationships between data	Manual ID lookups in code	JOIN, FK - built in
Scale	Fine up to ~100 MB	Terabytes without issue

**Files aren't the enemy.** Configs, logs, static assets - fine to store in files. But the moment data needs to be **searched, concurrently updated, and protected from loss** - a DBMS is required. The boundary is sharp: a file has no concept of a transaction.

A database is not a "file with an interface". It's a **specialized system** built around specific problems: atomicity (all or nothing), concurrent access without data loss, O(log n) indexing instead of O(n) scans, durability guarantees on crash. That's what earned Codd the Turing Award in 1981.

A team has a CSV file with 10 million orders. Two managers are simultaneously updating the same row. What problem arises?

Types of Databases

No "best" database exists - only the right tool for the problem. Relational databases have dominated for 50 years, but the last 15 brought dozens of alternatives. Each is optimized for the physics of a specific data access pattern.

**Relational (SQL)** - tables with strict schema, links through keys, standardized SQL. PostgreSQL: the most advanced open-source DBMS (Instagram, Heroku). MySQL: most common in web apps (WordPress, Shopify). SQLite: embedded, no server - every iPhone carries it.

**Document (NoSQL)** - data as JSON, no rigid schema: each document with its own structure. MongoDB leads. Works well when schema evolves fast or data is hierarchical. Struggles with complex joins and transactions.

**Key-Value** - key to value, data in RAM. Redis is the gold standard. Cache, sessions, counters, pub/sub. A single get takes microseconds. Not for complex queries - only get/set.

**Graph** - nodes and edges. Neo4j leads. "Friends of friends who bought X" - Neo4j handles in milliseconds, PostgreSQL with a triple JOIN takes seconds. LinkedIn's entire professional connections graph is built on this.

**Columnar** - data stored by column, not row. ClickHouse, Cassandra. A year-long analytics query on one field reads a single column from a billion rows instead of all the data. Kafka at LinkedIn processes 7 trillion messages per day - columnar storage underneath.

DB Type	Example	When to Use	When NOT to Use
Relational	PostgreSQL, MySQL	Finance, e-commerce, any relational data	Huge volumes of unstructured data
Document	MongoDB, CouchDB	Catalogs, CMS, prototypes with evolving schema	Complex entity relationships, transactions
Key-Value	Redis, Memcached	Cache, sessions, queues, counters	Complex queries, data relationships
Graph	Neo4j, ArangoDB	Social networks, recommendations, routing, fraud detection	Simple CRUD without relationships
Columnar	ClickHouse, Cassandra	Analytics, logs, time-series, aggregations	Frequent updates to individual rows

**Polyglot persistence** - one product, multiple databases. Shopify: MySQL for orders, Redis for cache, ClickHouse for analytics. Airbnb: PostgreSQL for bookings, Elasticsearch for search, Druid for metrics. Not complexity for its own sake - each database playing to its strengths.

A recommendation system is being built: 'users who are friends with their friends and bought similar products'. Which type of database is optimal?

CRUD Operations

Any application does exactly four things with data: **creates, reads, updates, deletes**. Four. Not five, not twenty. These operations have an acronym: **CRUD** - Create, Read, Update, Delete. Every REST API, every admin panel, every ORM in any language - it's CRUD on top of a database. All the architecture is just wrapping four commands.

**UPDATE and DELETE without WHERE** are the two most dangerous commands in SQL. They will affect ALL rows in the table. Always start with a SELECT using the same conditions to check which rows will be affected.

CRUD	SQL	HTTP (REST)	Description
Create	INSERT	POST	Create a new record
Read	SELECT	GET	Read data
Update	UPDATE	PUT / PATCH	Modify an existing record
Delete	DELETE	DELETE	Remove a record

**Soft delete** - instead of DELETE, set a flag `deleted_at = NOW()`. Data is recoverable, the audit trail is preserved. GitHub, Slack, Notion all do this. A real DELETE in production is rare. Data is worth more than disk space.

The goal is to change the email for the user with id = 42. What is the safest approach?

Client-Server Model

A database is a separate **server** (process) listening on a TCP port. The application never touches data files directly - it **sends SQL over the network**, and the server executes it and returns results. That's why PostgreSQL can run on another machine, in Docker, in the cloud - the application doesn't care.

A **driver** is a translator library. Code writes `db.query('SELECT ...')`, the driver packs it into a binary PostgreSQL Wire Protocol packet, sends it over TCP, parses the response, and returns rows. pg for Node.js, psycopg2 for Python, JDBC for Java - all doing the same thing.

**Connection pooling** - a critical optimization. TCP handshake + TLS + auth = 5-50 ms per new connection. A pool holds N ready connections and hands them out instantly. Without pooling at 500 RPS, half the response time goes to connecting. PgBouncer, HikariCP, Prisma all solve this.

**A common mistake** - a new connection per HTTP request. At 1000 RPS that's 1000 simultaneous TCP connections, each taking 20-50 ms. PostgreSQL default: max_connections = 100. 1000 > 100 = the server rejects connections. Without a pool, the application dies under load.

SQLite is not a real database - just a wrapper around a file

SQLite is a fully-featured relational DBMS with SQL, transactions, indexes, and ACID. It runs on billions of devices: every iPhone, Android, Chrome, and Firefox

SQLite differs only in architecture: embedded - runs inside the application process without a server. Telegram stores all messages in SQLite. It's not a simplified version - it's a different deployment model. By number of instances, SQLite is the most widely used DBMS in the world.

An application handles 500 HTTP requests/sec. Each request makes 1 query to PostgreSQL. A DB connection takes 20 ms to establish. Without a connection pool, how much time is spent just on connecting?

Key Takeaways

**Files break under concurrent access**: lost update - two processes read the same value and both overwrite it. Databases solve this through transactions and locks
**No universal database**: relational (linked data, ACID), document (flexible schema), key-value (cache, speed), graph (traversing relationships), columnar (analytics)
**CRUD** - four operations behind all backend work: INSERT, SELECT, UPDATE, DELETE in SQL; POST, GET, PUT, DELETE in REST
**Connection pool**: without it every request wastes 20-50 ms on TCP handshake. With a pool - 0.1 ms. At 500 RPS the difference is felt immediately
**Codd 1970, IBM said impractical** - Oracle 50 billion, every bank in the world. Impractical ideas sometimes win

Вопросы для размышления

In what scenario is a CSV file still preferable to a database? Name a specific case.
Designing a messenger - which database types for which data? Messages, contacts, online status, search history - each has its own type.
Why does a connection pool have an upper bound (maxconn)? Why can't 10,000 connections be opened simultaneously?
Codd proposed the relational model in 1970. IBM didn't believe in its practicality. What ideas in software today are considered impractical but might eventually win?

Связанные уроки

db-02-relational-model — After understanding why databases exist - the relational model as the dominant approach
os-01-intro — A DBMS runs on top of the operating system
ds-06-hash-intro — Hash tables are the foundation of fast indexes in relational databases
net-01-intro — The client-server model of a DBMS is analogous to network communication
alg-10-binary-search