PostgreSQL
Full-Text Search: tsvector and tsquery
A user types `databse indexs` with a typo and still needs to find the right thing. LIKE does a Seq Scan. Elasticsearch needs its own cluster. PostgreSQL FTS with the right dictionaries and a GIN index gives smart search inside the database itself, without sync, without eventual consistency, without a separate service.
- **Stack Overflow:** PostgreSQL FTS over 50M+ questions returns answers in 20-50 ms without Elasticsearch. A GIN index on a generated tsvector column covers title and body with different weights
- **Basecamp:** full-text search across project documents, tasks, and messages. Everything stays in PostgreSQL. Dropped the external search engine for simplicity and consistency
- **Discourse (forum):** search over millions of posts via PostgreSQL FTS. `websearch_to_tsquery` parses user input, `ts_headline` generates snippets in the SQL query itself
tsvector and tsquery: The FTS Foundation
PostgreSQL full-text search is built on two types: `tsvector`, a normalized vector of lexemes (words without endings, stop words, or duplicates), and `tsquery`, a search expression with boolean logic. The `@@` operator checks a match between them.
Always prefer `websearch_to_tsquery` for user input. It supports quotes for phrases, minus for exclusion, OR. `to_tsquery` requires explicit `&`, `|`, `!` operators and throws an error on plain text.
Which function correctly handles the user search query `"machine learning" -deep`?
Search Configurations and Languages
A text search configuration (TSConfig) defines which dictionaries to use for stemming, which words count as stop words, how to handle numbers and abbreviations. PostgreSQL ships with ready configurations for 20+ languages.
For Russian search use the `russian` configuration, but note: the standard Snowball stemmer for Russian works worse than Ispell dictionaries. For production Russian search consider the `hunspell_ru_ru` extension or `pg_trgm` as a supplement.
How do you check which lexemes PostgreSQL extracts from text under a given search configuration?
Ranking Search Results
Finding the documents is only half the work. Ordering them by relevance is where FTS becomes serious. PostgreSQL provides `ts_rank` and `ts_rank_cd` to compute relevance from match frequency and position.
The normalization parameter in `ts_rank`: `0` no normalization; `1` divide by 1 + log(length); `2` divide by length; `32` divide by rank + 1. Without normalization, long documents always win over short ones at the same term density.
What does `setweight(to_tsvector('english', title), 'A')` do before a `ts_rank` call?
Dictionaries and Result Highlighting
Dictionaries in FTS perform normalization: the Snowball stemmer strips endings, Ispell does spell checking, synonym maps words to canonical forms, thesaurus groups synonyms. The dictionary chain is applied in sequence.
`ts_headline` works on the raw text, not on a tsvector, so it cannot use a GIN index. For production, store the tsvector in a separate column: `ALTER TABLE articles ADD COLUMN fts tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;`. Search then uses the index, and `ts_headline` runs over the raw text only for the matching rows.
Why doesn't `ts_headline` benefit from a GIN index on tsvector?
GIN Index for Full-Text Search
Without a GIN index, FTS performs a Seq Scan and calls `to_tsvector` for every row. On a 1M row table that's seconds. A GIN index on a stored tsvector turns search into milliseconds.
Stack Overflow uses PostgreSQL FTS to search questions. With 50M+ questions, a GIN index over a tsvector column returns results in 20-50 ms. The alternative, Elasticsearch, requires a separate cluster, data synchronization, and extra infrastructure.
PostgreSQL FTS is a weak alternative to Elasticsearch. You need Elasticsearch for any serious search.
PostgreSQL FTS covers 80% of product needs: boolean queries, ranking, highlighting, multilingual support, phrase search. Elasticsearch is justified for complex search analytics, billions of documents, or distributed clusters.
Stack Overflow, Basecamp, GitHub (partially) use PostgreSQL FTS in production. The big advantage: the data is already in PostgreSQL. No eventual-consistency problems between the main DB and the search engine. Elasticsearch adds operational complexity: sync, an extra cluster, double storage.
Which approach gives the most efficient FTS on the table `articles(title, body)`?
Key Ideas
- **tsvector + tsquery + @@:** the FTS foundation. `to_tsvector` normalizes text, `websearch_to_tsquery` safely parses user input
- **GIN + STORED column:** add a column `fts tsvector GENERATED ALWAYS AS (...) STORED`, create a GIN index. Search becomes fast and stays current automatically
- **Ranking:** `ts_rank_cd` + `setweight` (A for title, B for body) + length normalization deliver a quality result order
Related Topics
Full-text search interacts with several PostgreSQL mechanisms:
- JSONB and GIN indexes — FTS and JSONB queries share the same GIN index mechanism. Understanding one helps with the other
- Extensions: pg_trgm — pg_trgm complements FTS for fuzzy search with typos and similarity matching. Often used together
- PostgreSQL Indexes — GIN is one index type. Choosing between GIN and GiST for FTS requires understanding the trade-offs of both
Вопросы для размышления
- A documentation site stores articles in Russian and English. How do you configure FTS so that search works correctly in both languages at once?
- A user complains that searching `PostgreSQL` doesn't find articles that say `Postgres`. How do you fix this without modifying existing data?
- A 10M row table has a GIN index on tsvector, but EXPLAIN still shows Seq Scan. What could have gone wrong?