System Design
Search Systems
Why you need a search engine
**Why SQL LIKE is not enough**
Problems: 1. **Full table scan** - O(n), no usable index 2. **No ranking** - every result has the same weight 3. **Exact match only** - 'headphone' will not find 'headphones' 4. **No typo tolerance** - 'headphons' returns nothing 5. **No synonyms** - 'earbuds' will not find 'headphones'
**Search engine capabilities:** | Feature
| **Full-text search** | Search by words, not substrings | | **Relevance ranking** | Best results at the top | | **Fuzzy matching** | Tolerates typos | | **Synonyms** | 'laptop' = 'notebook' | | **Stemming** | 'running' -> 'run' | | **Faceted search** | Filters by category | | **Autocomplete** | Suggestions as you type | | **Highlighting** | Highlights matched terms |
What did you learn about why you need a search engine?
Inverted Index
**Inverted Index, the backbone of full-text search** Regular index: document -> words Inverted index: word -> documents Think of it as the index at the back of a book: - 'algorithm' -> pages 15, 42, 89 - 'database' -> pages 23, 45, 67
**Text analysis pipeline:**
**N-grams for autocomplete and fuzzy match:**
What did you learn about the Inverted Index?
Elasticsearch Architecture
**Elasticsearch is a distributed search engine** Foundation: Apache Lucene (Java library for full-text search). Components: - **Cluster** - a group of nodes - **Node** - a single ES server - **Index** - the equivalent of a database - **Shard** - a partition of an index - **Document** - a JSON record (the equivalent of a row)
**Sharding strategy:**
**Important:** the number of primary shards is fixed at index creation. Changing it requires a full reindex.
**Node roles:** | Role
| **Master** | Manages cluster state, creates/deletes indices | | **Data** | Stores data, runs search/index operations | | **Ingest** | Pre-processes documents through pipelines | | **Coordinating** | Routes requests (default for every node) | | **ML** | Machine learning jobs |
What did you learn about Elasticsearch architecture?
Ranking and Relevance
**How do you decide what is relevant?** Main factors: 1. **TF (Term Frequency)** - how often the word appears in the document 2. **IDF (Inverse Document Frequency)** - how rare the word is across the corpus 3. **Field length** - short fields matter more 4. **Boost** - manual weighting of fields
**BM25, an improved TF-IDF** Problem with TF-IDF: TF grows linearly (10 repetitions = 10x the score). BM25 adds: - **Saturation** - TF saturates (after about 5 repetitions there is barely any gain) - **Field length normalisation** - short titles count for more than long descriptions
Elasticsearch has used BM25 by default since version 5.0.
**Function Score, for custom ranking:**
That gives you control over popularity, freshness, price, geolocation, and more.
What did you learn about Ranking and Relevance?
Advanced search features
**Autocomplete (search-as-you-type)** Three approaches: 1. **Edge N-grams** - index the prefixes 2. **Completion Suggester** - a special data structure 3. **Search-as-you-type field** - a ready-made field type
**Fuzzy search, tolerant to typos**
**Levenshtein distance:** - 'headphons' -> 'headphones' (distance = 1, insert 'e') - AUTO: 0 edits for 1-2 chars, 1 edit for 3-5, 2 edits for more than 5
**Highlighting - call out the matches in the results:**
What did you learn about advanced search features?