AI Engineering
Advanced RAG: hybrid search, re-ranking, query expansion, self-RAG
Цели урока
- Understand why naive RAG fails on real data and how to diagnose the problem
- Implement hybrid search (BM25 + dense vectors) with Reciprocal Rank Fusion
- Integrate cross-encoder reranking via Cohere or sentence-transformers
- Apply query transformation: HyDE (Gao 2022), multi-query, step-back prompting
- Master self-RAG and contextual compression for high-stakes domains
Предварительные знания
- RAG pipeline, pgvector, embeddings
Basic RAG finds similar text. Advanced RAG finds the right answer. The gap - precision@5: naive 60-70%, with reranking + hybrid search 85-92%. That's the boundary between "it works" and "we can sell this". Notion AI switched from naive RAG to hybrid + reranking and cut hallucination rate by 40%. Perplexity built their entire search on these same principles - 100M+ queries a month.
- Perplexity AI - hybrid search + re-ranking as the core of their search technology, 100M+ queries/month
- Notion AI - moving from naive RAG to hybrid + reranking cut hallucination rate by 40%
- Bloomberg GPT - domain-specific re-ranking for financial documents with specialized terminology
- Abridge and Nabla (medical AI) - self-RAG to verify every claim before showing it to a physician
- Anthropic Contextual Retrieval (2024) - BM25 + embeddings + contextual chunk summaries cuts retrieval failures by 67%
From Lewis 2020 to Contextual Retrieval 2024
**Lewis et al. 2020 (Facebook AI Research)** - the original RAG paper: DPR encoder + BART generator, first demonstration that retrieval + generation outperforms pure LM on knowledge-intensive tasks. **Khattab et al. 2020 - ColBERT**: late interaction - instead of one embedding per document, a matrix of token-embeddings with MaxSim scoring. More accurate than bi-encoder, faster than cross-encoder. **Gao et al. 2022 - HyDE**: hypothetical document embedding as a pre-retrieval transformation, +5-10% recall on the BEIR benchmark. **Anthropic Contextual Retrieval 2024**: BM25 + embeddings + LLM-generated chunk context (each chunk is enriched with a summary of its position in the document) - 67% reduction in failed retrievals.
Where Naive RAG Breaks Down
The naive RAG demo shows 95% accuracy - 10 test questions, clean chunks, friendly queries. Production shows 55% on real users. First instinct: upgrade the model. Wrong. GPT-4 won't help if the context was never retrieved. The problem is always in retrieval.
| Scenario | Naive RAG Problem | Example |
|---|---|---|
| Keyword mismatch | Vector search can't find exact terms | Query "payment error 500" - the document contains "PaymentGatewayException", embeddings don't match |
| Complex question | A single embedding can't express a multi-hop query | "How are rate limits related to API cost?" - needs chunks from different sections |
| Imprecise retrieval | Top-5 chunks contain 2 relevant and 3 noisy ones | The LLM gets distracted by irrelevant chunks and gives a vague answer |
| Ambiguous query | The user asks imprecisely | "How to set up authentication?" - OAuth? JWT? API keys? RBAC? |
Advanced RAG is three layers of surgery around the search. Not a replacement - an upgrade:
- **Pre-retrieval** - improve the query before searching (query transformation, HyDE by Gao et al. 2022)
- **Retrieval** - improve the search itself (BM25 + dense hybrid, multi-query with RRF)
- **Post-retrieval** - filter results after searching (cross-encoder reranking, contextual compression)
Each layer is added independently. Hybrid search - always. Re-ranking - when precision matters more than latency. HyDE - when queries are short and ambiguous. Combining all three is Contextual Retrieval (Anthropic, 2024), which cuts retrieval failures by 67%.
A user asks "deployment error", but the documentation describes the issue as "DeploymentFailedException in CI/CD pipeline". Naive RAG doesn't find the answer. What's the problem?
Hybrid Search: BM25 + dense vectors
Two worlds - two failures in isolation. Vector search finds "how to reduce latency" near "latency optimization" - semantics works. But ask "error E-4012" - the embedding has no idea what that code is. BM25 (TF-IDF on steroids) finds E-4012 instantly, but won't understand that "reduce latency" and "optimize latency" mean the same thing.
| Method | Strengths | Weaknesses |
|---|---|---|
| Vector (semantic) | "How to reduce latency?" → finds "latency optimization" | Misses exact terms: "error E-4012" |
| BM25 (keyword) | Exact match: "E-4012" → finds the document with that code | Doesn't understand synonyms: "reduce latency" ≠ "latency" |
| Hybrid (BM25 + dense) | Combines: semantics + keywords | Requires weight tuning (alpha) |
**Reciprocal Rank Fusion (RRF)** bridges both worlds - it doesn't average scores (they're incomparable), it merges rank lists. A document ranked 1st in BM25 and 3rd in vector search scores higher than one ranked 10th in both:
Implementing hybrid search with pgvector + PostgreSQL full-text search:
The **alpha** parameter is tuned empirically. For technical documentation (lots of error codes, API names) - alpha 0.3-0.4 (more weight on keywords). For general questions - alpha 0.6-0.7 (more weight on semantic). Perplexity uses a similar balance - it's the core of their search technology.
More chunks = better recall
More chunks = worse precision, an overloaded reranker, and more noise in the context
Fetching top-100 instead of top-20 does technically improve recall - the right document will almost certainly make the cut. But the cross-encoder reranker does a forward pass for every (query, doc) pair: 100 pairs instead of 20 means +400ms latency. More importantly, the LLM receives a noisy context - 80 irrelevant chunks mixed with 20 useful ones - and starts hallucinating or giving vague answers. The sweet spot: retrieve 15-25 candidates, rerank to 5-7, pass no more than 5 to the context.
In hybrid search, alpha = 0.3 means...
Re-ranking: Cross-encoder and Cohere Rerank
Hybrid search returned the top-20 documents. But the ranking is still rough - cosine similarity between a query embedding and a document embedding doesn't capture subtle semantic relationships. A bi-encoder encodes query and doc independently: they never "see" each other before comparison.
A cross-encoder is a different architecture entirely. Query and document are fed together as a single input, the model runs full attention between every token in the query and every token in the document. Slower - but precise:
| Bi-encoder (embedding) | Cross-encoder (reranker) | |
|---|---|---|
| How it works | Query and doc are encoded separately, compared via cosine similarity | Query and doc are fed together, model outputs a relevance score |
| Speed | Fast - doc is already encoded, comparison is instant | Slow - needs a forward pass for each pair |
| Accuracy | Good | Significantly better - the model sees both texts simultaneously |
| Scale | Millions of documents | Top 20-50 candidates (post-retrieval) |
The **retrieve-and-rerank** pattern is the production RAG standard. Cast a wide net at retrieval, apply a precise filter at reranking:
Cohere Rerank - 1 dollar per 1000 requests (each up to 100 documents). For a self-hosted option - **bge-reranker-v2-m3** by BAAI (sentence-transformers), runs on GPU via Hugging Face Inference. On CPU with 20 documents - around 300ms.
Why is a cross-encoder more accurate than a bi-encoder for ranking?
Query Transformation: HyDE, Multi-Query, Step-back
A user types "why doesn't the deployment work" - four words. The embedding of those four words is a single point in vector space, small and poorly oriented. The actual answer lives in a multi-page document with rich vocabulary. The distance between the question-point and the document-cloud is enormous. Query transformation moves the point before the search begins.
HyDE - Hypothetical Document Embedding
The idea from Gao et al. (2022): don't search by the query embedding - generate a **hypothetical answer** and search by its embedding instead. A hypothetical answer is written in the style of documentation, rich in terminology, long - its embedding sits much closer to real documents in vector space.
Multi-Query - Split the Question into Sub-queries
Step-back Prompting - Generalize the Question
Instead of the specific "why doesn't NestJS middleware catch async errors?" - first ask "how does error handling work in NestJS middleware?". A more general query finds foundational documents that contain the specific answer as a special case.
HyDE excels at Q&A over documentation - short questions, long documents. Multi-Query - for complex analytical questions with multiple aspects. Step-back - for "why doesn't X work?" where the cause is a principle, not a detail. In practice, RAPTOR and Contextual Retrieval (Anthropic 2024) combine all three.
HyDE (Hypothetical Document Embedding) searches by the embedding of...
Self-RAG and Contextual Compression
What if the user asks "what is 2+2?" - why run retrieval at all? And if 5 chunks were found but only 1 is actually relevant - why send all five to the context? **Self-RAG** is a pattern where the model makes both decisions itself: whether to search, and what to use from what was found.
Self-RAG: The Model as Critic
- LLM receives the question and decides: is retrieval needed? (some questions the model already knows)
- If yes - retrieval, get chunks
- LLM evaluates each chunk: relevant? Useful for the answer?
- Generates an answer based only on approved chunks
- LLM evaluates the final answer: is it grounded in context? Useful to the user?
Contextual Compression
A chunk retrieved is 500 tokens - but only 2 sentences actually answer the question. The other 18 are noise, consuming context window and distracting the model. **Contextual compression** (popularized by LangChain, refined in Anthropic Contextual Retrieval 2024) extracts only what matters.
Self-RAG and compression add extra LLM calls. In production that's +200-500ms latency and +0.001-0.01 dollars per query. Use when accuracy matters more than speed - medical, legal, financial chatbots. Abridge and Nabla apply exactly this to verify every claim before showing it to a physician.
Contextual compression in RAG solves the problem of...
RAG Fusion and Parent Document Retriever
RAG Fusion
RAG Fusion combines Multi-Query and RRF into a single pipeline. Multiple query variations are generated, each searches independently, and results are merged via Reciprocal Rank Fusion. The insight: one query formulation is one point in vector space. Four formulations are four points - covering far more territory.
Parent Document Retriever
Small chunks give precise retrieval, poor context. Large chunks give poor retrieval, good context. A classic trade-off. **Parent Document Retriever** cuts it in half: stores small chunks for search (128 tokens), large parent chunks for context (1024 tokens). Searches the small, returns the large.
Summary of all advanced RAG techniques and when to apply them:
| Technique | When to use | Overhead |
|---|---|---|
| Hybrid Search | Always - baseline for production | +10ms (SQL query) |
| Re-ranking | When precision matters more than latency | +200ms, USD 0.001/query |
| HyDE | Q&A over documentation | +300ms, USD 0.001/query |
| Multi-Query | Complex analytical questions | +300ms, USD 0.001/query |
| RAG Fusion | Maximum recall, critical accuracy | +500ms, USD 0.003/query |
| Self-RAG | Medicine, law, finance | +500ms, USD 0.005/query |
| Parent Doc Retriever | Long documents requiring context | +5ms (different data schema) |
| Contextual Compression | Limited context window | +200ms, USD 0.001/query |
If naive RAG produces poor answers, just add more documents to the index. The bigger the corpus, the better the retrieval.
Retrieval quality is bottlenecked by query formulation and relevance distribution, not corpus size. Hybrid search, re-ranking and query transformation outperform corpus growth in 3-5x of cases on the same index.
Intuition borrowed from classical SQL or full-text systems: more data, sharper answer. In RAG the opposite often holds - a larger corpus increases the chance that naive vector search drowns in semantic near-duplicates and misses an exact match. The fix is an ensemble of methods on top of the same base, not raw volume.
Parent Document Retriever searches by small chunks but returns large parent chunks. Why?
More chunks = better recall, just fetch top-100
More chunks = worse precision, an overloaded reranker, and more noise in the context
Recall technically improves - the right document will almost certainly land in top-100. But the cross-encoder reranker does a forward pass for every (query, doc) pair: 100 pairs instead of 20 means +400ms. More critically, the LLM receives a noisy context - 80 irrelevant chunks mixed with 20 useful ones - and starts hallucinating or giving vague answers. The sweet spot: retrieve 15-25 candidates, rerank to 5-7, pass no more than 5 to the context.
Key Takeaways
- Naive RAG fails on keyword mismatch, complex questions, and ambiguous queries - the problem is always in retrieval, not the model
- Hybrid search (BM25 + dense + RRF) is the mandatory production baseline - covers keyword mismatch with near-zero latency cost
- Cross-encoder reranking (sentence-transformers, Cohere) delivers +15-20% precision on top-5: the model sees query and doc together
- HyDE (Gao 2022) searches by the embedding of a hypothetical answer - for short questions against long documents
- Self-RAG + contextual compression - for medicine, law, finance where hallucinations are unacceptable
- Contextual Retrieval (Anthropic 2024): BM25 + embeddings + chunk context summaries - 67% fewer retrieval failures
Вопросы для размышления
- Of the four naive RAG failure scenarios (keyword mismatch, multi-hop, noisy retrieval, ambiguous query), which one would hurt a specific product the most?
- Hybrid search adds +10ms, reranking +200ms, HyDE +300ms. What's a reasonable latency budget for a specific use case - and which combination fits?
- Self-RAG adds approximately 5 LLM calls per user query. At what point does the accuracy gain justify that cost?
What's Next
Retrieval is tuned, but its quality critically depends on how documents are split. Chunking strategy is the next optimization lever.
- Chunking Strategies — How to properly split documents - fixed, recursive, semantic chunking
- Conversation Memory — RAG in the context of a chatbot - how to combine retrieval with conversation history
- Evaluation — How to systematically measure RAG pipeline quality on golden datasets
Связанные уроки
- aie-12-rag-fundamentals — Baseline RAG is the foundation for advanced techniques
- aie-09-embeddings — Dense retrieval quality depends on embedding quality
- aie-10-vector-databases — Qdrant HNSW index is critical for hybrid search performance
- aie-14-chunking-strategies — Proper chunking doubles precision before any reranking
- prob-04-bayes — Reranking is a posterior update over prior retrieval scores
- ml-52-search-ranking — Cross-encoder reranking from IR learning-to-rank
- db-26-caching
- db-28-search