Information Retrieval
Search System Design
In September 2008 Bing engineers ran an internal experiment: inject an artificial 200 ms delay on every query. The result after two weeks - queries dropped by 0.4%, revenue dropped by 2.5%. After six months users were returning to Bing less often at all - the habit was gone. Google ran the analogous experiment: every 100 ms of latency cost ~0.6% queries per user. These numbers (known as 'Greg Linden's 100ms rule', originally from Amazon 2006) drive the entire architecture of modern search: shard size, replica count, tail-latency budget, hedged requests. Search system design is not about 'the right ranking algorithm', it is about disciplined achievement of sub-200 ms p99 over billion-document indexes.
- **Google** - 100K QPS at peak, 200 ms p99 SLA, fanout to hundreds of leaf shards with hedged requests; the Caffeine indexer refreshes hot content in the index in <60 seconds after publication.
- **Amazon A9** - lexical search plus learned-to-rank over 100+ signals; sponsored placements drive 10-15% of revenue but require careful balancing against organic relevance.
- **Yandex Search** - Russian morphology adds a dedicated lemmatizer in front of the inverted index; geolocation-based personalization (Moscow vs regions) matters more than for Google in the US.
Design: Web Search at Google Scale
When the interviewer asks 'design Google', most candidates open with an inverted index - and that is correct. But the details decide whether the level is passed. The real Google indexes ~400B pages, serves 100K QPS, replies under 200 ms. To meet that SLA, the system is built as three parallel pipelines: crawl (fetch pages, politeness policies, refresh), index (parse, tokenize, shard by doc-id), serve (query parsing, federation, ranking, snippet generation). The key numbers to call out in an interview: every query fans out to hundreds of shards in parallel (root -> leaf), each shard holds a slice of the inverted index in RAM, latency = max shard latency plus merge.
Tail latency: at fanout 1000 shards, even a 100 ms p99 per shard yields a final p99 of ~700 ms (probability that 'all below 100 ms' is 0.99^1000). Mitigations: hedged requests (the same query to a replica after 50 ms), spare compute capacity, and cutting off shards slower than median + 2*MAD.
With a fanout of 1000 leaf shards each at p99=100 ms, what is the final p99 of the overall response?
Design: E-commerce Search (Amazon, Wildberries)
E-commerce search differs from web search because the ranking object is a product, not a page. Relevance is mixed with business logic: stock availability, price, margin, sponsored placements. A query for 'iphone 15 pro' should not return iPhone 14 cases (lexical match on 'iphone'); it should filter by structured fields: brand=Apple, model=iPhone 15 Pro, category=Smartphone. Amazon A9 combines lexical search (BM25 over title/description) with structured filters and learned-to-rank over 100+ signals: CTR, conversion rate, return rate, price competitiveness. Wildberries and Ozon add their own twists: geolocation-aware personalization (delivery tomorrow vs in 5 days), recommendation widgets inside the search result list.
Facets and aggregations: in e-commerce the result page needs a 'filter' side panel with counts per category, color, brand. This requires aggregations over search results - extra cost on every query. Solution: pre-computed facet counts in the index, refreshed in batches.
Why is the 'in stock' filter applied to e-commerce results as a hard filter rather than softly as a ranking signal?
Typeahead and Query Suggestion
Every keystroke in Google's search box fires an API call that must answer in <50 ms. At scale that is 10x the QPS of main search. The typeahead architecture is fundamentally different: no inverted index over documents, but a Trie or FST (Finite State Transducer) over the most popular queries, weighted by frequency and time of day. Input is a prefix; output is the top-10 continuations. The key data structure is a prefix tree with an array of popular continuations at each node. A memcached layer sits in front of the trie for the hottest prefixes. Additional signals: personal history (autocompleting recent queries), trending (boost for today's spikes), safety (filter for forbidden patterns).
FST (Finite State Transducer) is more compact than a Trie thanks to suffix sharing. Lucene Solr uses FST for completion: 100 million queries fit in ~2 GB, entirely in RAM on a single machine. Lookup speed is O(prefix length), independent of dictionary size.
Why does typeahead typically use a Trie/FST over queries instead of an inverted index over documents?
Personalization Without Catastrophe
Personalization in search is a double-edged sword. Too weak and everyone sees the same thing, wasting the user signal. Too strong and a filter bubble forms: the user is never shown what lies outside their interest bubble. Google strikes a balance: personalization affects ~15-20% of positions, the rest is driven by global relevance. Personalization signals are search history (7-30 days), geolocation, interface language. Architecture: a global ranker yields top-100, then a re-ranker with user features reorders only the first 30. E-commerce is more aggressive: up to 50% of the ranking depends on the user (Amazon places 'things you usually buy' above category results). Risks: cold start (new user has no history), shadow profile (search from another device), drift (user interests change).
Two-tower model: the classic personalized ranking architecture - two independent networks, one encoding the user into an embedding, the other encoding the document. Match via dot product. The user tower refreshes often (every few minutes) on short history; the document tower rarely (hours). User embeddings cache on the edge for sub-10 ms latency.
A great search system equals a more powerful ML ranking model
A great search system equals data quality (crawl, indexing, freshness) plus honest signals plus infrastructure with predictable tail latency; ML adds ~10-15% on top of that foundation
Bing and Yandex experiments show: moving from BM25 to neural ranking yields 5-15% NDCG. The same delta comes from improving index freshness from 24h to 1h, or from fixing crawler bugs. ML is the final polish, not the substance of search.
Google limits personalization to ~20% of positions. Why not push it to 100%?
Key ideas
- **Web search at scale** runs three parallel pipelines (crawl/index/serve); the main serve pain is tail latency under hundreds-fold fanout.
- **E-commerce search** ranks products, not pages; relevance mixes with availability, price, sponsored placements; hard filter for in-stock.
- **Typeahead** uses a separate architecture: Trie/FST over popular queries in RAM, not an inverted index over documents; sub-50 ms SLA.
- **Personalization** is a double-edged sword: weak is useless, strong creates filter bubbles; the sensible balance is 15-30% of positions.
Related topics
Search system design weaves all earlier IR topics into one infrastructure:
- Search Infrastructure at Scale — Tiering, caching, and federation are the bricks; here they assemble into a product-level Google or Amazon design
- Learning to Rank — LTR models are the final stage of the serve pipeline; their role is re-ranking top-K after lexical retrieval, not primary retrieval
Вопросы для размышления
- Back to the hook: 100 ms of latency = -0.6% queries. Which architectural trade-offs are justified to keep p99 < 200 ms, and which are not?
- Cold start for a new user in personalized search - how should explore (suggest something new) and exploit (show something relevant from sparse signals) be balanced?
- If the interviewer asked to design search for a video platform (TikTok, YouTube), which components from this lesson survive and which need fundamentally different solutions?
Связанные уроки
- ir-12 — System design builds on scalable search infrastructure
- ir-04 — Learning to Rank is the core ranking component in design
- ir-07 — Vector DB integrates into semantic search architecture
- ir-08 — Query understanding is the first layer of the search pipeline
- ds-02-cap-theorem — Precision/recall/latency trilemma mirrors CAP theorem
- net-37-load-balancing — Search load balancing is a special case of general LB
- dist-12-consistency