Databases
Search Engines
GitHub has 330 million repositories. Searching 'django authentication middleware' across all public code returns results in under 100ms. A SQL LIKE '%django%' query across 28 billion rows would take hours. Elasticsearch's inverted index - mapping each token to a posting list of documents - makes this possible. The same technology powers Wikipedia's multilingual search, Shopify's product catalog, and Airbnb's listing search.
- **GitHub**: Elasticsearch indexes 28 billion code documents (files, commits, issues). Code search across all public repositories returns in under 100ms using inverted index lookups and shard-level parallelism.
- **Wikipedia**: Elasticsearch serves search for 6+ million articles across 300+ languages. Language-specific analyzers (stemming, stop words) are configured per field to handle morphologically complex languages like Arabic and Finnish.
- **Shopify**: Elasticsearch powers product search across 10+ billion listings. Filter context clauses (merchant_id, status=active) are cached as bitsets; BM25 with popularity boost ranks results by relevance and sales velocity.
Inverted Index
An inverted index maps each unique term to the list of documents containing it. At index time, text is tokenized and each token's posting list is updated. At query time, the engine looks up each query term in the index and intersects or unions the posting lists. This enables sub-millisecond lookups over billions of documents - impossible with a sequential scan or a B-Tree on raw text.
GitHub uses Elasticsearch with 28 billion code documents to serve searches like 'django authentication middleware'. Results return in under 100ms. A LIKE '%django%' query across 28 billion rows in PostgreSQL would take hours. The inverted index makes this difference.
An Elasticsearch query searches for 'quick brown fox' with AND logic. Does the document 'The quick red fox jumps' match?
Analyzers and Field Mapping
An analyzer transforms raw text into tokens before indexing. It consists of a character filter (strip HTML), a tokenizer (split on whitespace or punctuation), and token filters (lowercase, stop words, stemming). The same analyzer must be applied at query time for consistent matching. Mapping defines the data type of each field and which analyzer it uses.
Mapping cannot be changed for existing fields - this would require reindexing all documents. Use an index alias and reindex to a new index with updated mapping for zero-downtime schema changes. Always define explicit mappings for production indexes; dynamic mapping guesses types and often gets them wrong.
A field 'status' contains values: 'active', 'inactive', 'banned'. Which Elasticsearch mapping type is correct?
Query DSL: Query vs Filter Context
Elasticsearch Query DSL distinguishes query context (affects relevance score) from filter context (binary yes/no, no score, cached). Queries in the must/should clause contribute to scoring; filters in the filter clause are cached as bitsets and reused across requests. Mixing the two correctly is critical for performance and relevance quality.
Shopify uses Elasticsearch to search over 10 billion product listings. Filter context clauses (status=active, merchant_id=X) are cached as bitsets and reused across thousands of requests per second for the same merchant. The must clause (full-text product title match) computes relevance only for documents passing the cached filter.
A query searches articles by text (with relevance) and filters by publication date. In which context should the date condition appear?
Relevance Scoring: BM25
BM25 (Best Match 25) is the default relevance algorithm in Elasticsearch, replacing TF-IDF. It scores documents based on term frequency (how often the term appears), inverse document frequency (how rare the term is across the corpus), and field length normalization (penalizes long documents where term frequency is diluted). BM25 caps the TF contribution to prevent a term appearing 100 times from dominating over a document with strong IDF.
Document A contains the word 'search' 10 times. Document B contains it once, but 'search' is very rare in the corpus. Which gets a higher BM25 score?
Elasticsearch Cluster Architecture
An Elasticsearch cluster consists of nodes with different roles: master nodes coordinate cluster state, data nodes store shards, and coordinating nodes route requests. Indexes are divided into primary shards for write parallelism and replica shards for read scaling and high availability. Each shard is a complete Lucene index.
The number of primary shards is fixed at index creation and cannot be changed (without reindexing). Over-sharding a small index wastes resources; under-sharding prevents scaling. A common mistake is creating 10 shards for an index that will only ever hold 1 GB of data.
A 3-node Elasticsearch cluster has an index with 3 primary shards and 1 replica each. One node fails. What is the cluster state?
Summary
- **Inverted index**: maps each token to a posting list of documents. AND queries intersect posting lists; OR queries union them. Enables sub-millisecond full-text search over billions of documents.
- **Analyzers**: transform text to tokens at index and query time (tokenize, lowercase, stem, remove stop words). Must be consistent between indexing and querying. keyword fields skip analysis entirely.
- **Query vs filter context**: must/should clauses affect BM25 relevance score. filter clauses are binary, cached as bitsets, and do not touch the score - use filter for dates, status, IDs.
- **BM25**: TF (capped by k1) * IDF (rare terms score higher) * field length normalization. Use the _explain API to debug unexpected rankings.
- **Cluster**: primary shards for write parallelism, replicas for read scaling and HA. Primary shard count is fixed at creation. Failing node promotes replicas to primary; cluster turns yellow until a replacement node joins.
Related Topics
Search engines integrate with the broader data architecture:
- Polyglot Persistence — Elasticsearch is commonly a search layer on top of PostgreSQL or MongoDB as the source of truth. Changes sync to Elasticsearch via CDC or application-level dual writes.
- Vector Databases — Elasticsearch 8.x supports dense vector fields and approximate nearest-neighbor search, enabling hybrid keyword + semantic search in one query.
- Database Monitoring — The ELK Stack (Elasticsearch, Logstash, Kibana) is a standard log analytics platform. Elasticsearch indexes log events; Kibana visualizes them.
Вопросы для размышления
- How is zero-downtime reindexing implemented when a mapping change requires rebuilding an index? What role do index aliases play?
- An application stores products in PostgreSQL as the source of truth and syncs to Elasticsearch for search. A product update in PostgreSQL fails to sync to Elasticsearch. How is this inconsistency detected and repaired?
- When is PostgreSQL full-text search (tsvector + tsquery) sufficient, and when does Elasticsearch become necessary?