PostgreSQL
Extensions: pgvector, PostGIS, pg_stat
Why store embeddings in Pinecone, geo data in MongoDB, logs in InfluxDB, and primary data in PostgreSQL when PostgreSQL with extensions handles all of it? pgvector, PostGIS, and TimescaleDB turn one database into a platform while keeping unified transactions, permissions, and SQL.
- **Supabase:** pgvector underpins semantic search and RAG. Embeddings live next to the data, JOINs and RLS work natively without syncing to a separate vector store.
- **Uber:** PostGIS computes surge pricing zones and finds nearest drivers. Spatial queries against 5M+ active devices run in milliseconds via GiST indexes.
- **Grafana Cloud:** TimescaleDB stores 10B+ metrics per day. Continuous aggregates render year-long graphs under 100 ms without recomputing billions of points at query time.
The PostgreSQL Extension System
Extensions are the official mechanism for adding new data types, functions, index methods, and operators to PostgreSQL without modifying the core. Everything installs with a single `CREATE EXTENSION` command and integrates fully with SQL syntax, transactions, and access control.
Extensions install into a specific schema (default `public`). On managed PostgreSQL (AWS RDS, Google Cloud SQL, Supabase), only a limited list of extensions is available. Not every open-source extension is allowed by the provider. Check the provider documentation for the supported list.
What happens with `DROP EXTENSION pgvector CASCADE` if the database has tables with `vector` columns?
pgvector: Vector Search for AI Applications
pgvector adds a `vector` type for storing embeddings (numeric representations of text, image, or audio semantics) and operators for approximate nearest neighbor (ANN) search. Instead of a separate vector database (Pinecone, Weaviate), everything lives in PostgreSQL alongside metadata.
Supabase uses pgvector for semantic search and RAG applications. Embeddings live next to the data in PostgreSQL, JOINs with user_id, project_id and RLS work natively. That removes the synchronization problem between PostgreSQL and a separate vector database.
Which pgvector operator should you use for nearest-neighbor search with cosine similarity?
PostGIS: Geospatial Data
PostGIS is PostgreSQL's largest extension. It adds geometry and geography types, 1000+ geo functions, support for WKT/WKB/GeoJSON, and the GiST/SP-GiST spatial indexes. It is used wherever distances, polygons, routes, or geo filters matter.
Uber uses PostGIS for surge pricing zones and routing. With 5M+ active rides per day, spatial queries run in milliseconds thanks to GiST indexes. PostGIS turns PostgreSQL into a full GIS platform without additional software.
When should you use `geography` instead of `geometry`?
pg_trgm: Fuzzy Search and Typos
pg_trgm splits strings into trigrams (three-character chunks) and computes similarity from their overlap. This makes it possible to find strings with typos, partial matches, and case-insensitive matches. GIN/GiST indexes over trigrams give fast `LIKE '%text%'` searches.
A GIN index over pg_trgm makes `WHERE column LIKE '%text%'` fast, which is one of its main use cases. A regular B-tree index does not help when LIKE starts with `%`. On a table with 10M+ rows, this is the difference between 5 seconds and 5 milliseconds.
The query `WHERE name LIKE '%iphone%'` is slow. Which solution speeds it up without changing the query?
pg_cron: A Scheduler Inside PostgreSQL
pg_cron is a scheduler extension that runs SQL queries and functions on a cron schedule directly inside PostgreSQL. No external cron daemon is required and no external script needs to connect to the database. Tasks live alongside the data.
pg_cron only works in the postgres database (shared_preload_libraries). Run jobs in another database via `cron.schedule_in_database()`. Important: if PostgreSQL crashes, unfinished jobs are not restarted automatically. Monitor `cron.job_run_details`.
What is the advantage of pg_cron over an external cron script that connects to PostgreSQL?
TimescaleDB: Time-Series on PostgreSQL
TimescaleDB is an extension that turns PostgreSQL into a specialized time-series database. Core concepts: hypertable (auto-partitioned by time), continuous aggregates (auto-refreshed materialized views), plus retention and compression policies.
Grafana uses TimescaleDB for Grafana Cloud metrics: 10B+ points per day. Continuous aggregates render year-long graphs in milliseconds. Without them, every query would recompute billions of rows. Compression reduces storage by 10 to 20 times.
Time-series data always needs a dedicated database (InfluxDB, Prometheus TSDB). PostgreSQL cannot handle high write throughput.
TimescaleDB matches or beats InfluxDB on most benchmarks at 100k+ writes/sec while keeping all PostgreSQL advantages: SQL, JOIN, ACID, extensions.
TimescaleDB uses time-partitioned chunks for write amplification, columnar compression for reads, and automatic continuous aggregates. The same techniques as specialized TSDBs, but on top of PostgreSQL. Grafana, Timescale Cloud, and many companies store time-series data in PostgreSQL+TimescaleDB without paying for extra infrastructure.
What is the key difference between a TimescaleDB hypertable and standard PostgreSQL partitioning?
Key Ideas
- **pgvector:** store and search embeddings directly in PostgreSQL. HNSW index for ANN search. Operators: cosine `<=>`, L2 `<->`, inner product `<#>`.
- **PostGIS + pg_trgm:** geo data (geography for global distances, GiST index) and fuzzy search (a GIN index makes `LIKE '%text%'` fast).
- **pg_cron + TimescaleDB:** scheduled jobs without external services. Time-series with auto partitioning, compression, and continuous aggregates.
Related Topics
Extensions amplify other PostgreSQL mechanisms:
- Full-Text Search — pg_trgm complements built-in FTS. Where tsvector cannot handle typos, trigram search finds similar strings.
- Table Partitioning — TimescaleDB automates partition management for time-series. pg_cron automates partition creation for standard partitioning.
- JSONB and GIN Indexes — pgvector adds a new data type and index methods (HNSW, IVFFlat), similar to how JSONB extends PostgreSQL through GIN.
Вопросы для размышления
- An application stores user embeddings in Pinecone and primary data in PostgreSQL. What problems does this architecture create? When is pgvector the better choice?
- The `products` table has 50M rows. The query `WHERE name ILIKE '%samsung%'` takes 8 seconds. How can it be accelerated without changing the SQL?
- TimescaleDB compression saves 90 percent of storage but makes data read-only for INSERT/UPDATE. For which data is this acceptable and for which is it not?