PostgreSQL
GIN and GiST: Full-Text Search and Geo
Booking.com processes 1.5M requests per second. Some of them are searches for hotels by geolocation and filters by tags ('wifi', 'pool', 'pet-friendly'). B-Tree does not fit here. GIN and GiST are specialized indexes that make such queries run in milliseconds.
- **Uber/Lyft**: finding nearest drivers - GiST (PostGIS ST_DWithin) on a table of geo-positions. Location updates ~4 times/sec across millions of devices
- **GitHub Search**: searching code and issues - GIN on tsvector. Indexing 200M+ repositories. An alternative to Elasticsearch for transactional data
- **StackOverflow**: question tags - GIN on a tags array. Queries like 'find questions with tags [postgresql] AND [indexing]' via the @> operator
GIN: Inverted Index Like in Search Engines
Elasticsearch at its core is an inverted index. PostgreSQL GIN (Generalized Inverted Index) is the same idea, built directly into the database. Structure: a dictionary of keys → posting list (list of heap blocks where the key appears). For the string 'postgresql database', two entries are created: 'postgresql' and 'database', each referencing the row.
GIN is optimized for types with multiple values within a single row: arrays, JSONB, tsvector, hstore. B-Tree builds one index key per row; GIN builds N keys. For a query `@>` or `&&`, PostgreSQL intersects posting lists - fast even across millions of rows.
ML parallel: TF-IDF and BM25 are built on top of an inverted index. GIN is the data structure; ranking is a separate layer. pgvector for vector search uses a different approach (HNSW/IVFFlat), but GIN remains the best for exact multi-value matches (tags, ACLs, JSON paths).
Why is GIN faster than B-Tree for `tags @> '{go,rust}'` on a table with 10M rows?
GIN for JSONB and Arrays: Operators and Indexability
Two classes of GIN operators. For arrays: `@>` (contains), `&&` (overlap), `<@` (contained by). For JSONB: the same plus `?` (key exists), `?|` (any of keys exists), `?&` (all keys exist). GIN automatically indexes all keys and values in JSONB - so a query `data ? 'email'` hits the index.
Two opclasses for JSONB: `jsonb_ops` (default) indexes keys and values separately, supports all operators. `jsonb_path_ops` indexes only paths to values, does not support `?`/`?|`/`?&`, but is 2-3x more compact and faster for `@>`. The choice depends on the query pattern.
Which opclass to choose for GIN on JSONB when queries are only `data @> '{"status":"active"}'`?
GIN for Full-Text Search: tsvector and tsquery
PostgreSQL full-text search: a document is converted into a `tsvector` (normalized lexemes with positions), a query into a `tsquery`. The `@@` operator checks for a match. A GIN index on a tsvector column turns this into instant search via posting lists.
Why not Elasticsearch? For volumes up to 10-50M documents, PostgreSQL FTS is competitive: no additional infrastructure, transactional consistency, JOIN with primary data. Elasticsearch wins for: sharding across multiple machines, complex scoring (BM25 with fields), real-time indexing >100K doc/sec.
What is the purpose of `setweight()` when building a tsvector?
GiST for Spatial Types and PostGIS
GiST (Generalized Search Tree) is an extensible index framework where each tree node stores not an exact key but a predicate (bounding box, range, circle). Search is a recursive traversal: subtrees whose predicate is incompatible with the query are skipped. This is R-Tree with generalized predicates.
GiST vs GIN: the choice is determined by the query type. GiST is for spatial predicates (ST_Within, KNN, range overlap), always one key per row. GIN is for multi-value containment queries (arrays, JSONB, FTS). A single data type can support both: tsvector supports GIN (faster for search) and GiST (lower memory on writes).
Which index does PostGIS use for the KNN query 'find 10 nearest points'?
GiST for Range Types and Exclusion Constraints
PostgreSQL range types: `int4range`, `tsrange`, `daterange`, etc. A GiST index on a range type supports operators `&&` (overlap), `@>` (contains), `<@` (contained by). This enables queries like 'find all bookings overlapping with interval [check_in, check_out]' in O(log N).
Exclusion constraint is more powerful than UNIQUE: UNIQUE = 'no two equal values', EXCLUDE = 'no two rows satisfying an arbitrary operator'. The classic use case is calendar booking without double-booking. The constraint is checked via GiST index on insert/update. Insert performance: O(log N) versus O(N) for a trigger-based solution.
GIN and GiST are just different names for the same type of specialized index
GIN is an inverted index for multi-value types (N keys per row); GiST is an extensible tree with predicates (1 key/predicate per row). Different structures for different tasks
GIN: posting list per key -> fast membership lookup. GiST: balanced tree with lossy keys -> fast spatial/range predicate search. Tsvector supports both, but GIN is faster for search, GiST is better for frequent updates
Why is EXCLUDE USING gist better than a CHECK constraint with a subquery for preventing booking overlaps?
Related Topics
GIN and GiST are specialized indexes for specific data types:
- JSONB in PostgreSQL — GIN indexes are the primary tool for queries on JSONB documents
- Full-Text Search — GIN indexes tsvector - the foundation of FTS in PostgreSQL
- Extensions: PostGIS — PostGIS uses GiST for spatial indexes
Key Ideas
- **GIN - inverted index**: key -> posting list of rows. Optimal for @>, &&, ?, when a row has N values (arrays, JSONB, tsvector).
- **jsonb_ops vs jsonb_path_ops**: jsonb_ops is universal (?, ?|, ?&, @>); jsonb_path_ops is 2-3x more compact for pure @> queries.
- **Full-text: GIN + tsvector**: setweight for field-boosting, ts_rank for ranking, phraseto_tsquery for phrase search.
- **GiST - extensible tree**: predicates instead of exact keys. KNN via <->, ST_DWithin for radius, range && for overlaps.
- **EXCLUDE USING gist**: exclusion constraint via GiST index - atomic protection against overlaps in O(log N) instead of trigger-based O(N).
Вопросы для размышления
- A GIN index on JSONB with jsonb_ops indexes all keys and values. What happens to the index size when value cardinality is high (UUIDs as JSON values)?
- An exclusion constraint with GiST is checked on insert. How does this affect performance for bulk importing booking data?
- GIN supports a 'fast update' mode - buffering changes. When is this useful and when is it harmful?
Связанные уроки
- pg-14-partial-expr — Partial and expression indexes are the preceding level of abstraction
- pg-16-brin — BRIN is the next specialized index in the series
- pg-32-jsonb — GIN indexes are the primary tool for JSONB queries
- pg-33-fulltext — GIN indexes tsvector for full-text search
- pg-36-extensions — PostGIS uses GiST for spatial queries
- alg-05 — GiST is a generalized search tree: same BST principle, different predicates
- db-10-indexes-advanced