PostgreSQL

DML: INSERT, UPDATE, DELETE, UPSERT

Six seconds that destroyed 300 GB

January 31, 2017. 23:00 UTC. GitLab SRE engineer Gyorgy Dacho was trying to stop a spam attack on the production database. Wrong terminal window - staging vs production. One command: DELETE FROM. No WHERE clause. Enter. Six seconds. 300 gigabytes of data - repository histories, issues, merge requests for hundreds of thousands of users - gone. The last valid backup was six hours old. GitLab documented the entire incident publicly at gitlab.com/gitlab-com/gl-infra/runbooks. Approximately 5000 projects were permanently lost. DML is the most dangerous tool a developer holds. Also the most capable.

INSERT adds a row. UPDATE changes it. DELETE removes it. That is the syntax. Behind it live architectural patterns: upsert solves the race condition that breaks at high concurrency, RETURNING eliminates the extra SELECT after an insert, CTE turns a series of mutations into one atomic operation. And one forgotten WHERE in DELETE can cost six seconds and three hundred gigabytes.

  • **Upsert in counters**: UPDATE page_views SET count = count + 1 ON CONFLICT DO UPDATE - atomic increment without race conditions
  • **RETURNING in APIs**: INSERT ... RETURNING id, created_at - get the generated PK without a second round-trip
  • **Batch INSERT**: inserting 10,000 rows with one INSERT VALUES (...), (...) instead of 10,000 separate queries - 50-100x faster
  • **Soft delete via UPDATE**: UPDATE records SET deleted_at = NOW() WHERE id = $1 - data preserved, recoverable
  • **CTE with DML**: WITH inserted AS (INSERT ... RETURNING) UPDATE ... - multi-step mutations as a single atomic operation

INSERT and RETURNING

The classic scenario: a user is registered and the API needs to return their new ID immediately. The naive path: INSERT, then SELECT WHERE email = $1. Two queries. Two round-trips. A potential race if another process inserts the same user in between. PostgreSQL closed this in 2001 with one keyword: RETURNING.

RETURNING is not INSERT-only. UPDATE and DELETE return rows too - in their post-change state. This collapses the 'find-then-delete' pattern into one atomic operation: DELETE FROM jobs WHERE id = (SELECT id FROM jobs ORDER BY created_at LIMIT 1) RETURNING *. That is how PostgreSQL-backed job queues work - Que, Solid Queue, pgqueuer all rely on this.

**Batch INSERT performance:** inserting 10,000 rows in one INSERT VALUES (...), (...), ... is 50-100x faster than 10,000 individual INSERTs. Reason: single round-trip, single parse/plan, single transaction. For very large volumes, the unnest() pattern with a prepared statement is even more efficient.

INSERT INTO users (email) VALUES ('test@example.com') RETURNING id - what does this query return?

UPDATE and UPDATE FROM

UPDATE looks simple. The internals are not. PostgreSQL does not modify a row in place: MVCC creates a new row version; the old one is marked dead and later cleaned up by VACUUM. Each UPDATE is effectively an INSERT plus 'mark old version dead'. On write-heavy tables this produces table bloat and demands careful autovacuum tuning. That is the MVCC lesson. Here: syntax and patterns.

UPDATE FROM is a PostgreSQL extension to the SQL standard. It allows a JOIN inside UPDATE. Standard SQL requires a correlated subquery - PostgreSQL offers a FROM clause with an explicit join. The difference shows in the query plan: the planner builds a hash join instead of a nested loop for the subquery, which matters at scale.

**UPDATE without WHERE is a disaster.** UPDATE orders SET status = 'cancelled' without WHERE updates every row in the table. In production: always open a transaction (BEGIN), run a SELECT with the same WHERE first to verify the target set, then UPDATE, check rowCount, and only then COMMIT.

The task: increase price by 10% for all products in the 'electronics' category. Which approach is safest?

DELETE and safe deletion patterns

GitLab lost 300 GB in 6 seconds precisely because DELETE FROM without WHERE is near-instant - PostgreSQL does not iterate rows one by one, it marks all of them dead in a single pass. A transaction would have saved the situation - but the engineer never opened BEGIN. PostgreSQL has no recycle bin. Without PITR backup, the data is gone permanently.

**Soft delete vs hard delete:** DELETE physically removes the row (MVCC creates a dead tuple; VACUUM eventually reclaims it). Soft delete - UPDATE SET deleted_at = NOW() - keeps the data, allows recovery, but the table grows. The pragmatic middle ground: soft delete plus periodic archival of old 'deleted' rows into an archive table.

DELETE FROM sessions WHERE user_id = 42 RETURNING id - what happens if the user has no sessions?

Upsert: INSERT ON CONFLICT

Classic high-concurrency problem: insert a row if it does not exist, update it if it does. The naive approach - SELECT, check, then INSERT or UPDATE - falls apart under concurrency: between the SELECT and the INSERT two processes can both see 'row absent' and both attempt INSERT. One will fail with a unique violation. PostgreSQL solved this in 2015 with PostgreSQL 9.5: INSERT ... ON CONFLICT.

EXCLUDED is the key detail. It is a pseudo-table containing the values that attempted insertion during the conflict. This enables sophisticated logic: update only if the new value is larger, increment a counter, ignore if the row is 'frozen'. The 'insert or increment counter' pattern is the foundation of high-throughput event analytics systems.

**ON CONFLICT requires a unique constraint or index:** PostgreSQL needs to know which columns define a conflict. ON CONFLICT (column) only works if a UNIQUE constraint or UNIQUE index exists on those columns. ON CONFLICT ON CONSTRAINT requires a named constraint.

In the query INSERT ... ON CONFLICT DO UPDATE SET views = page_stats.views + EXCLUDED.views - what is EXCLUDED?

CTE with DML: atomic mutation chains

CTE (WITH) in PostgreSQL is not just for SELECT. Since PostgreSQL 9.1 a WITH block can contain INSERT, UPDATE, DELETE with RETURNING. The result of one mutation becomes the input for the next. Everything in a single transaction, a single round-trip. The pattern is indispensable for: creating a record in one table and immediately using its ID in another, atomically moving a row between tables, implementing a queue without stored procedures.

Atomic row migration between tables is the archetypal use case. DELETE from source with RETURNING, INSERT into destination with the RETURNING data. If any part fails - the entire operation rolls back. No intermediate states, no duplicates, no lost rows. This is how hot-to-cold data archival works in PostgreSQL without the risk of losing rows mid-operation.

**DML CTE and snapshot isolation:** all DML operations in one WITH share the same data snapshot taken at query start - they do not see each other's changes. An UPDATE in one CTE will not see rows inserted by an INSERT in another CTE of the same query. This matters when designing complex mutation chains.

CTEs in PostgreSQL are lazy - if a CTE's result is unused, it does not execute

DML inside a CTE always executes, even if the RETURNING result is never referenced

PostgreSQL 12+ made SELECT-only CTEs lazy (inline optimization). But DML CTEs (INSERT/UPDATE/DELETE) are always materialized and executed, because the side effects (data changes) must happen regardless of whether RETURNING is consumed. This guarantees predictable behavior.

CTE with DML: WITH del AS (DELETE FROM queue WHERE id = 1 RETURNING *) INSERT INTO archive SELECT * FROM del - what happens if the row with id=1 does not exist?

Key ideas

  • **INSERT VALUES / SELECT**: single or bulk; INSERT ... SELECT for copying and transforming data between tables
  • **RETURNING**: returns rows after INSERT/UPDATE/DELETE - eliminates the redundant SELECT
  • **ON CONFLICT DO NOTHING / UPDATE**: upsert - atomic 'insert or update' without race conditions
  • **UPDATE FROM / DELETE USING**: JOIN in DML - update/delete based on data from another table
  • **CTE with DML**: WITH ... AS (INSERT/UPDATE/DELETE RETURNING) - chain of mutations in one atomic query
  • **The GitLab rule**: always verify WHERE before DELETE/UPDATE; always use a transaction

Related topics

DML is at the center of data work in PostgreSQL, connected to transactions, optimization, and migrations:

  • Transactions and ACID — BEGIN / COMMIT / ROLLBACK wrap DML - without them DELETE without WHERE is permanent
  • SELECT and reading data — INSERT ... SELECT and RETURNING bridge writing and reading
  • CTE (Common Table Expressions) — WITH ... AS (INSERT/UPDATE/DELETE RETURNING) - multi-step mutations
  • Locks — UPDATE without an index on WHERE triggers Sequential Scan plus RowExclusiveLock on many rows
  • Schema migrations — Flyway and Liquibase contain DML for data backfills during migrations

Вопросы для размышления

  • GitLab erased data in 6 seconds. What three technical measures at the SQL interface level could have prevented the disaster?
  • INSERT ... ON CONFLICT DO UPDATE and SELECT + INSERT both add a row if one is absent. Why is the first approach superior under high concurrency?
  • CTE with RETURNING lets one INSERT feed a subsequent UPDATE in a single statement. When is that preferable to two separate queries?

Связанные уроки

  • pg-04-ddl — DDL creates tables - DML works with their contents
  • pg-06-select — SELECT reads what DML writes
  • pg-10-cte — CTE with DML: atomic multi-step mutations
  • pg-23-transactions — Safe DML always runs inside a transaction with rollback
  • pg-26-locks — UPDATE without index on WHERE clause triggers table-level scan locks
  • pg-49-migrations — Data migrations are applied DML in a managed, versioned form
  • db-05-sql-basics
DML: INSERT, UPDATE, DELETE, UPSERT

0

1

Sign In