Information Retrieval
Introduction to Information Retrieval
1998. AltaVista - the world's largest search engine, 200 million indexed pages. Slow. Irrelevant. Two Stanford students launch Google with half the index - and destroy AltaVista in two years. The secret? One trick: inverted index plus PageRank. Not more hardware, not more money - a different data structure. Today Google processes 8.5 billion queries a day, each in 200 milliseconds.
- **Google Search** - inverted index of 100B+ pages plus BERT reranking: from the 1998 data structure to 2019 neural nets
- **Elasticsearch** powers search at GitHub (500M+ repositories), Wikipedia, Uber - full-text search in milliseconds
- **Algolia** delivers e-commerce search in under 10 ms; **Qdrant and Pinecone** do vector search for semantics, where "automobile" finds "car"
Indexing
One billion documents. One query. 200 milliseconds. Linear scan is physically impossible at this scale - the speed of light won't help. A different idea is needed. **Indexing** means building a data structure that makes "find" cheap. Not faster hardware - a smarter structure.
Google indexes over **100 billion** web pages. Without an index, scanning them all per query would take days. With an inverted index, the answer arrives in **0.2 seconds**. The gap isn't in CPU speed - it's algorithmic: O(n) versus O(1).
| Approach | Complexity | 100B documents |
|---|---|---|
| Linear scan | O(n) | ~hours |
| Index (hash lookup) | O(1) | ~milliseconds |
| Inverted index | O(k) where k = posting list length | ~milliseconds |
The idea is almost offensively simple: instead of "document → words", store "word → documents". Flip the mapping direction. That's the **inverted index** - the structure every search engine runs on. AltaVista had it too. Google added PageRank on top - and won.
From Card Catalogs to Google
1876 - Melvil Dewey invents the Decimal Classification for libraries. 1960s - Gerard Salton at Cornell builds SMART: the first system with TF-IDF and a vector model. 1990s - the web explodes, AltaVista tries to index everything. 1998 - Brin and Page add the link graph and PageRank. One algorithm changes the industry. 2019 - Google adds BERT: now it understands query meaning, not just keywords.
Why is an inverted index called "inverted"?
Ranking
The index found a million pages for "Python". Congratulations. Now the real question: which one comes first? This is **ranking** - and this is where all the value of a search engine lives. Finding is easy. Ordering correctly is hard.
75% of clicks go to the top three results. Most users never scroll past the first page. Ranking determines which site gets traffic and which doesn't. That's billions of dollars in advertising packed into a single algorithm.
Two fundamental approaches. **Content-based**: look inside the document - word frequency, rarity in the corpus, position in the text (TF-IDF, BM25). **Link-based**: look outside - who links here, how authoritative are those sources (PageRank). Google beat AltaVista precisely because it was the first to combine both.
Modern Google combines hundreds of signals: BM25 for word relevance, PageRank for domain authority, CTR and dwell time for user behavior, freshness for recency. Since 2019 - BERT for understanding the full query meaning, not just individual words. Elasticsearch for in-product search uses the same stack: BM25 plus an optional neural reranker.
| Signal | Type | Example |
|---|---|---|
| TF-IDF / BM25 | Content-based | Word frequency and rarity in document |
| PageRank | Link-based | Number and quality of inbound links |
| CTR | Behavioral | Click-through rate from search results |
| Freshness | Temporal | Date of last update |
| BERT score | Semantic | Neural understanding of meaning |
What is the key idea behind PageRank?
Relevance
Ranking tries to put "relevant" documents higher. But what does **relevant** mean? This is the central and surprisingly uncomfortable question in IR. It can't be written into code. It can't be computed by formula. It's the degree to which a document matches what the user actually wanted to find - not just what they typed.
Relevance is **subjective**. Two professional assessors disagree on the same document in 30-40% of cases - Cohen's kappa around 0.6-0.7. That's why Google employs thousands of "raters" worldwide and spends billions on data annotation. Subjectivity can't be removed - only averaged away.
Three levels. **Topical**: the document is on the same topic. **User**: it fits this specific person with their context. **Situational**: it helps solve the task right now. A query for "Python" from a student - tutorial. From a biologist - a snake. From DevOps - a package manager. One query, three correct answers. That's why search engines look at history, location, and device - not just the text.
| Relevance type | Question | Example |
|---|---|---|
| Topical | Is the document on the same topic? | Query "Python" → article about Python |
| User | Is it right for this user? | Beginner → tutorial, expert → PEP |
| Situational | Does it solve the task? | "Python sorting" → specific algorithm |
In practice, **relevance grades** are used: 0 (not relevant), 1 (partial), 2 (relevant), 3 (perfect). TREC (Text REtrieval Conference), running since 1992, organizes competitions where humans annotate thousands of query-document pairs. These datasets became the standard benchmark for comparing search systems - the ImageNet of IR.
Why is assessing relevance a difficult task?
Precision & Recall
Subjective relevance doesn't prevent measuring search with numbers. Two fundamental metrics do the job: **Precision** and **Recall**. The first asks: how much noise is in the results? The second: how much useful content was missed? They pull in opposite directions - and that tension defines which search engine users choose.
**Precision** - of what the system returned, how much is actually useful? **Recall** - of all the useful content that exists, how much did the system find? Improve one, hurt the other. Return everything - Recall = 1, Precision = garbage. Return only one perfect result - Precision = 1, Recall = disaster. The balance between them is the art of ranking.
The trade-off determines the application. A medical system searching for drug side effects needs maximum Recall - missing something is dangerous. Legal precedent search - same. Web search and e-commerce need Precision: users don't scroll past the third result; noise at the top means they go to a competitor.
| Metric | Formula | Priority in... |
|---|---|---|
| Precision | TP / (TP + FP) | Web search, recommendations |
| Recall | TP / (TP + FN) | Medicine, legal search |
| F1-score | 2PR / (P + R) | Balanced tasks |
| F0.5 | 1.25PR / (0.25P + R) | When Precision matters more |
| F2 | 5PR / (4P + R) | When Recall matters more |
For ranked lists, P/R isn't enough - they ignore position. The first result matters more than the tenth. That's why **NDCG** (Normalized Discounted Cumulative Gain) is used - a metric that penalizes a relevant result stuck on a low position. NDCG is what Google uses internally to evaluate quality. **MAP** and **MRR** are for systems where a single correct answer is expected.
Search is just grep over files, only faster
Information Retrieval solves the problem of ranking billions of documents by relevance to a fuzzy user query
grep searches for exact string matches - returns everything or nothing, no ranking. IR works with semantics: "automobile" finds "car", "vehicle", "auto". It considers query context, source authority, and behavior of other users. This isn't "faster grep" - it's a different class of problem
A system returned 10 documents, 4 of which are relevant. Total relevant in collection: 8. What is Recall?
Key Ideas
- **Inverted index** - the core trick of IR: flipping "document → words" into "word → documents", O(n) to O(1). The structure AltaVista had but Google weaponized with PageRank
- **Ranking** is harder than retrieval: from a million matching documents, the right one must come first. BM25, PageRank, BERT - each generation solved this better than the last
- **Precision** (noise in results) and **Recall** (useful content missed) are in constant trade-off: medicine wants Recall, web search wants Precision
- Relevance is subjective: two experts disagree 30-40% of the time. That's why search quality is measured through A/B tests, not just metrics
Related Topics
Information retrieval unifies ideas from several fields:
- Inverted Index — The core data structure for search - covered in detail in the next lesson
- TF-IDF and BM25 — Mathematical ranking models based on word frequency
Вопросы для размышления
- AltaVista had twice as many indexed pages as early Google. Why didn't that help? What exactly decided the outcome of the competition?
- When is Recall more important than Precision, and when is the opposite true? Give real-life examples where the wrong choice would be costly.
- When building a search engine for an internal company portal - which architecture fits best: inverted index, vector search, or a hybrid? Why?
Связанные уроки
- nlp-01 — NLP processes query and document text: tokenization, stemming and embeddings feed the IR pipeline
- db-01-intro — Databases store indexes; SQL full-text search is the simplest IR built on top of a DBMS
- ml-01-intro — Learning-to-rank and neural rerankers are ML applied to the ranking problem
- alg-01-big-o — The shift from O(n) to O(1) via the inverted index is a fundamental algorithmic idea
- ir-02 — The next lesson dissects the inverted index data structure in detail
- dm-01