System Design

Search Systems

Why you need a search engine

**Why SQL LIKE is not enough**

Problems: 1. **Full table scan** - O(n), no usable index 2. **No ranking** - every result has the same weight 3. **Exact match only** - 'headphone' will not find 'headphones' 4. **No typo tolerance** - 'headphons' returns nothing 5. **No synonyms** - 'earbuds' will not find 'headphones'

**Search engine capabilities:** | Feature

What did you learn about why you need a search engine?

Inverted Index

**Inverted Index, the backbone of full-text search** Regular index: document -> words Inverted index: word -> documents Think of it as the index at the back of a book: - 'algorithm' -> pages 15, 42, 89 - 'database' -> pages 23, 45, 67

**Text analysis pipeline:**

**N-grams for autocomplete and fuzzy match:**

What did you learn about the Inverted Index?

Elasticsearch Architecture

**Elasticsearch is a distributed search engine** Foundation: Apache Lucene (Java library for full-text search). Components: - **Cluster** - a group of nodes - **Node** - a single ES server - **Index** - the equivalent of a database - **Shard** - a partition of an index - **Document** - a JSON record (the equivalent of a row)

**Sharding strategy:**

**Important:** the number of primary shards is fixed at index creation. Changing it requires a full reindex.

**Node roles:** | Role

What did you learn about Elasticsearch architecture?

Ranking and Relevance

**How do you decide what is relevant?** Main factors: 1. **TF (Term Frequency)** - how often the word appears in the document 2. **IDF (Inverse Document Frequency)** - how rare the word is across the corpus 3. **Field length** - short fields matter more 4. **Boost** - manual weighting of fields

**BM25, an improved TF-IDF** Problem with TF-IDF: TF grows linearly (10 repetitions = 10x the score). BM25 adds: - **Saturation** - TF saturates (after about 5 repetitions there is barely any gain) - **Field length normalisation** - short titles count for more than long descriptions

Elasticsearch has used BM25 by default since version 5.0.

**Function Score, for custom ranking:**

That gives you control over popularity, freshness, price, geolocation, and more.

What did you learn about Ranking and Relevance?

Advanced search features

**Autocomplete (search-as-you-type)** Three approaches: 1. **Edge N-grams** - index the prefixes 2. **Completion Suggester** - a special data structure 3. **Search-as-you-type field** - a ready-made field type

**Fuzzy search, tolerant to typos**

**Levenshtein distance:** - 'headphons' -> 'headphones' (distance = 1, insert 'e') - AUTO: 0 edits for 1-2 chars, 1 edit for 3-5, 2 edits for more than 5

**Highlighting - call out the matches in the results:**

What did you learn about advanced search features?

Связанные уроки

db-09-indexes-btree