Information Retrieval

Query Understanding

Users type 'iphone 15 pric', 'best headfones under 100', and 'where to buy nike shos cheap'. A retrieval system that takes queries literally returns zero results or completely irrelevant documents for all three. Query understanding is the layer that transforms messy human language into structured retrieval signals - it is the difference between a search engine and a text-matching system.

**Google** processes 15% of daily queries it has never seen before (2023). Query understanding - including spell correction, synonym expansion, and entity disambiguation - handles this by mapping novel surface forms to known concepts. Without it, every new product release would have near-zero results until the index was updated.
**Amazon** applies intent classification at 200+ granularities per query. 'Batteries' classified as 'replenishment' (AA/AAA batteries for remotes) returns different results than 'batteries' classified as 'accessory' (phone battery packs). This routing alone contributes $1B+ in annual GMV according to Amazon's 2022 re:Invent talk.
**LinkedIn Search** uses entity linking to normalize company names: 'Google', 'Alphabet', 'GOOGL', and 'Google Inc.' all resolve to the same company entity, ensuring that a recruiter searching 'Google engineers' finds profiles mentioning any form of the name across 930 million member profiles.

Spell Correction and Normalization

Search queries contain 10-15% misspelled terms (Google's internal data). Spell correction operates in two stages: candidate generation and candidate selection. Candidates are generated by edit distance (Levenshtein <= 2) or phonetic matching (Soundex, Metaphone). Selection ranks candidates by a noisy-channel model: P(correction | query) ∝ P(query | correction) * P(correction), where P(correction) is the language model unigram/bigram probability from query logs and P(query | correction) is the confusion matrix probability of the observed typo given the intended word.

**Context-sensitive correction** (used by Google since 2011) uses the surrounding words to disambiguate. 'teh' → 'the' in isolation, but 'I was shocked by teh results' correctly corrects to 'the' because 'results' and 'shocked' constrain the language model. Modern systems use BERT-based models (SpellBERT, MLM masking) that attend to full context, achieving <1% error rate on real query logs.

In the noisy-channel spell correction model, what does P(query | correction) represent?

Query Expansion

Query expansion adds or substitutes terms to improve recall for queries that are too specific or use non-standard vocabulary. Approaches: (1) thesaurus-based - add synonyms from WordNet or a domain lexicon; (2) pseudo-relevance feedback (PRF) - retrieve top-k documents, extract frequent terms from them, add to query; (3) embedding-based - find lexically different terms close in embedding space. Bing's query expansion (2019 disclosure) expanded 40% of queries, improving click-through rate by 6% on long-tail queries.

**Query drift** occurs when expansion terms change the query meaning. 'Apple' expanded with 'fruit, orchard, cider' would be catastrophic for device searches. Production systems use selective expansion: only expand low-frequency queries (head queries already return good results), gate expansions by query-term co-occurrence probability in click logs, and apply strict topical coherence filters. Google's Multitask Unified Model (MUM) uses semantic understanding to avoid drift.

Pseudo-relevance feedback assumes that:

Intent Classification

Intent classification determines what action the user wants to take with a query. The classic taxonomy (Broder 2002): navigational (go to a specific site: 'youtube'), informational (learn something: 'how does HNSW work'), and transactional (do something: 'buy iPhone 15'). Modern systems use fine-grained multi-label intent (40+ categories at Google, 200+ at Amazon). Intent routes the query to different ranking pipelines: navigational → direct URL redirect; transactional → product carousel + ads; informational → web results + Knowledge Panel.

Intent Type	Example Query	Serving Strategy	Metric
Navigational	"facebook login"	Redirect to facebook.com	Navigation success rate
Informational	"what is HNSW"	Web results + Knowledge Panel	NDCG@10
Transactional	"buy AirPods Pro"	Product listings + ads	RPM, CTR on products
Local	"coffee near me"	Map pack + local listings	Store visit rate

Why is intent classification applied before ranking rather than after?

Named Entity Recognition and Linking

Entity linking identifies named entities in a query (people, places, products, companies) and maps them to entries in a knowledge base (Wikipedia, Wikidata, Google KG). The pipeline: (1) NER detects entity spans ('Apple' in 'Apple quarterly earnings'); (2) candidate generation retrieves KB entries for the surface form; (3) entity disambiguation selects the correct entry using context (Apple Inc. vs apple the fruit). Google processes 2 trillion queries per year with entity linking active on ~30% (Orr et al., 2021), enabling direct Knowledge Panel answers.

**Entity linking enables structured answers**: once 'Apple' is linked to Q312 (Apple Inc.) in Wikidata, the search engine can directly answer 'Apple CEO' by traversing the CEO relation in the knowledge graph rather than retrieving web documents. This powers Google's 'One Box' features - 40% of searches return a direct answer without a click (2023 data).

Query understanding is preprocessing that can be applied independently of the retrieval model - any combination of spell correction, expansion, and entity linking always improves search quality.

Query understanding modules interact with each other and with the retrieval model in non-trivial ways. Incorrect entity linking can corrupt expansion; over-aggressive expansion can dilute entity-specific signals. Each module must be evaluated end-to-end on the full pipeline, not in isolation.

Suppose 'apple iphone' is entity-linked to Apple Inc. and expanded with synonyms of 'apple' (MacBook, Mac, iPad). If spell correction then 'corrects' the linked entity surface to 'Apple iPhone Pro', the downstream retrieval may return Pro models only. Evaluating each module's NDCG in isolation masks this interference. Google's query understanding team runs A/B tests on the full serving pipeline, not on module-level metrics.

In the query 'mercury poisoning symptoms', entity linking to 'Mercury (planet)' or 'Freddie Mercury' would be incorrect. What mechanism prevents this error?

Key Ideas

**Spell correction** uses the noisy-channel model (P(correction) * P(query | correction)) to select the most probable intended word. Context-sensitive BERT-based correctors outperform edit-distance heuristics by using surrounding query terms.
**Query expansion** adds related terms to improve recall, but risks query drift. Production systems apply expansion selectively: only on low-frequency queries, gated by co-occurrence probability in click logs.
**Intent classification and entity linking** route queries to the right pipeline and enable structured answers. They are evaluated end-to-end in the full serving system because errors in one module propagate and amplify in downstream components.

Вопросы для размышления

A user searches 'python' on a programming platform. Design a full query understanding pipeline: which intents are possible, how would spell correction interact with the programming language name, and how would entity linking disambiguate from python-the-snake? What signals would be most useful?
Pseudo-relevance feedback can cause a negative feedback loop: if the initial retrieval is poor, the extracted expansion terms are irrelevant, further degrading recall. How would a production system detect and break this loop?
Google's Multitask Unified Model (MUM) processes queries in 75 languages simultaneously. How does this affect the query expansion design - specifically, can expansion terms from one language's knowledge base be used to improve results in another language's index?

Связанные уроки

ml-01-intro

Spell Correction and Normalization

In the noisy-channel spell correction model, what does P(query | correction) represent?

Query Expansion

Pseudo-relevance feedback assumes that:

Intent Classification

Intent Type

Example Query

Serving Strategy

Metric

Navigational

"facebook login"

Redirect to facebook.com

Navigation success rate

Informational

"what is HNSW"

Web results + Knowledge Panel

NDCG@10

Transactional

"buy AirPods Pro"

Product listings + ads

RPM, CTR on products

Local

"coffee near me"

Map pack + local listings

Store visit rate

Why is intent classification applied before ranking rather than after?

Named Entity Recognition and Linking

Query understanding is preprocessing that can be applied independently of the retrieval model - any combination of spell correction, expansion, and entity linking always improves search quality.

In the query 'mercury poisoning symptoms', entity linking to 'Mercury (planet)' or 'Freddie Mercury' would be incorrect. What mechanism prevents this error?

Key Ideas

**Spell correction** uses the noisy-channel model (P(correction) * P(query | correction)) to select the most probable intended word. Context-sensitive BERT-based correctors outperform edit-distance heuristics by using surrounding query terms.

**Query expansion** adds related terms to improve recall, but risks query drift. Production systems apply expansion selectively: only on low-frequency queries, gated by co-occurrence probability in click logs.

**Intent classification and entity linking** route queries to the right pipeline and enable structured answers. They are evaluated end-to-end in the full serving system because errors in one module propagate and amplify in downstream components.

Query Understanding

Spell Correction and Normalization

Query Expansion

Intent Classification

Named Entity Recognition and Linking

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Query Understanding

Spell Correction and Normalization

Query Expansion

Intent Classification

Named Entity Recognition and Linking

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки