Information Retrieval
Query Understanding
Users type 'iphone 15 pric', 'best headfones under 100', and 'where to buy nike shos cheap'. A retrieval system that takes queries literally returns zero results or completely irrelevant documents for all three. Query understanding is the layer that transforms messy human language into structured retrieval signals - it is the difference between a search engine and a text-matching system.
- **Google** processes 15% of daily queries it has never seen before (2023). Query understanding - including spell correction, synonym expansion, and entity disambiguation - handles this by mapping novel surface forms to known concepts. Without it, every new product release would have near-zero results until the index was updated.
- **Amazon** applies intent classification at 200+ granularities per query. 'Batteries' classified as 'replenishment' (AA/AAA batteries for remotes) returns different results than 'batteries' classified as 'accessory' (phone battery packs). This routing alone contributes $1B+ in annual GMV according to Amazon's 2022 re:Invent talk.
- **LinkedIn Search** uses entity linking to normalize company names: 'Google', 'Alphabet', 'GOOGL', and 'Google Inc.' all resolve to the same company entity, ensuring that a recruiter searching 'Google engineers' finds profiles mentioning any form of the name across 930 million member profiles.
Spell Correction and Normalization
Search queries contain 10-15% misspelled terms (Google's internal data). Spell correction operates in two stages: candidate generation and candidate selection. Candidates are generated by edit distance (Levenshtein <= 2) or phonetic matching (Soundex, Metaphone). Selection ranks candidates by a noisy-channel model: P(correction | query) ∝ P(query | correction) * P(correction), where P(correction) is the language model unigram/bigram probability from query logs and P(query | correction) is the confusion matrix probability of the observed typo given the intended word.
**Context-sensitive correction** (used by Google since 2011) uses the surrounding words to disambiguate. 'teh' → 'the' in isolation, but 'I was shocked by teh results' correctly corrects to 'the' because 'results' and 'shocked' constrain the language model. Modern systems use BERT-based models (SpellBERT, MLM masking) that attend to full context, achieving <1% error rate on real query logs.
In the noisy-channel spell correction model, what does P(query | correction) represent?
Query Expansion
Query expansion adds or substitutes terms to improve recall for queries that are too specific or use non-standard vocabulary. Approaches: (1) thesaurus-based - add synonyms from WordNet or a domain lexicon; (2) pseudo-relevance feedback (PRF) - retrieve top-k documents, extract frequent terms from them, add to query; (3) embedding-based - find lexically different terms close in embedding space. Bing's query expansion (2019 disclosure) expanded 40% of queries, improving click-through rate by 6% on long-tail queries.
**Query drift** occurs when expansion terms change the query meaning. 'Apple' expanded with 'fruit, orchard, cider' would be catastrophic for device searches. Production systems use selective expansion: only expand low-frequency queries (head queries already return good results), gate expansions by query-term co-occurrence probability in click logs, and apply strict topical coherence filters. Google's Multitask Unified Model (MUM) uses semantic understanding to avoid drift.
Pseudo-relevance feedback assumes that:
Intent Classification
Intent classification determines what action the user wants to take with a query. The classic taxonomy (Broder 2002): navigational (go to a specific site: 'youtube'), informational (learn something: 'how does HNSW work'), and transactional (do something: 'buy iPhone 15'). Modern systems use fine-grained multi-label intent (40+ categories at Google, 200+ at Amazon). Intent routes the query to different ranking pipelines: navigational → direct URL redirect; transactional → product carousel + ads; informational → web results + Knowledge Panel.
| Intent Type | Example Query | Serving Strategy | Metric |
|---|---|---|---|
| Navigational | "facebook login" | Redirect to facebook.com | Navigation success rate |
| Informational | "what is HNSW" | Web results + Knowledge Panel | NDCG@10 |
| Transactional | "buy AirPods Pro" | Product listings + ads | RPM, CTR on products |
| Local | "coffee near me" | Map pack + local listings | Store visit rate |
Why is intent classification applied before ranking rather than after?
Named Entity Recognition and Linking
Entity linking identifies named entities in a query (people, places, products, companies) and maps them to entries in a knowledge base (Wikipedia, Wikidata, Google KG). The pipeline: (1) NER detects entity spans ('Apple' in 'Apple quarterly earnings'); (2) candidate generation retrieves KB entries for the surface form; (3) entity disambiguation selects the correct entry using context (Apple Inc. vs apple the fruit). Google processes 2 trillion queries per year with entity linking active on ~30% (Orr et al., 2021), enabling direct Knowledge Panel answers.
**Entity linking enables structured answers**: once 'Apple' is linked to Q312 (Apple Inc.) in Wikidata, the search engine can directly answer 'Apple CEO' by traversing the CEO relation in the knowledge graph rather than retrieving web documents. This powers Google's 'One Box' features - 40% of searches return a direct answer without a click (2023 data).
Query understanding is preprocessing that can be applied independently of the retrieval model - any combination of spell correction, expansion, and entity linking always improves search quality.
Query understanding modules interact with each other and with the retrieval model in non-trivial ways. Incorrect entity linking can corrupt expansion; over-aggressive expansion can dilute entity-specific signals. Each module must be evaluated end-to-end on the full pipeline, not in isolation.
Suppose 'apple iphone' is entity-linked to Apple Inc. and expanded with synonyms of 'apple' (MacBook, Mac, iPad). If spell correction then 'corrects' the linked entity surface to 'Apple iPhone Pro', the downstream retrieval may return Pro models only. Evaluating each module's NDCG in isolation masks this interference. Google's query understanding team runs A/B tests on the full serving pipeline, not on module-level metrics.
In the query 'mercury poisoning symptoms', entity linking to 'Mercury (planet)' or 'Freddie Mercury' would be incorrect. What mechanism prevents this error?
Key Ideas
- **Spell correction** uses the noisy-channel model (P(correction) * P(query | correction)) to select the most probable intended word. Context-sensitive BERT-based correctors outperform edit-distance heuristics by using surrounding query terms.
- **Query expansion** adds related terms to improve recall, but risks query drift. Production systems apply expansion selectively: only on low-frequency queries, gated by co-occurrence probability in click logs.
- **Intent classification and entity linking** route queries to the right pipeline and enable structured answers. They are evaluated end-to-end in the full serving system because errors in one module propagate and amplify in downstream components.
Related Topics
Query understanding is the entry point to the retrieval pipeline:
- Vector Databases — After query understanding, the processed query is embedded into a vector and sent to the ANN index for dense retrieval. Entity linking can produce multiple query vectors (one per detected entity) for a multi-vector search strategy.
- Learning to Rank — Intent classification determines which ranking model is applied. A transactional query uses a revenue-optimized ranker; an informational query uses a quality-and-freshness ranker. The same candidate set is ranked differently based on intent signals.
Вопросы для размышления
- A user searches 'python' on a programming platform. Design a full query understanding pipeline: which intents are possible, how would spell correction interact with the programming language name, and how would entity linking disambiguate from python-the-snake? What signals would be most useful?
- Pseudo-relevance feedback can cause a negative feedback loop: if the initial retrieval is poor, the extracted expansion terms are irrelevant, further degrading recall. How would a production system detect and break this loop?
- Google's Multitask Unified Model (MUM) processes queries in 75 languages simultaneously. How does this affect the query expansion design - specifically, can expansion terms from one language's knowledge base be used to improve results in another language's index?