AI Engineering

Advanced RAG: hybrid search, re-ranking, query expansion, self-RAG

Цели урока

  • Understand why naive RAG fails on real data and how to diagnose the problem
  • Implement hybrid search (BM25 + dense vectors) with Reciprocal Rank Fusion
  • Integrate cross-encoder reranking via Cohere or sentence-transformers
  • Apply query transformation: HyDE (Gao 2022), multi-query, step-back prompting
  • Master self-RAG and contextual compression for high-stakes domains

Предварительные знания

  • RAG pipeline, pgvector, embeddings
  • RAG Fundamentals

Basic RAG finds similar text. Advanced RAG finds the right answer. The gap - precision@5: naive 60-70%, with reranking + hybrid search 85-92%. That's the boundary between "it works" and "we can sell this". Notion AI switched from naive RAG to hybrid + reranking and cut hallucination rate by 40%. Perplexity built their entire search on these same principles - 100M+ queries a month.

  • Perplexity AI - hybrid search + re-ranking as the core of their search technology, 100M+ queries/month
  • Notion AI - moving from naive RAG to hybrid + reranking cut hallucination rate by 40%
  • Bloomberg GPT - domain-specific re-ranking for financial documents with specialized terminology
  • Abridge and Nabla (medical AI) - self-RAG to verify every claim before showing it to a physician
  • Anthropic Contextual Retrieval (2024) - BM25 + embeddings + contextual chunk summaries cuts retrieval failures by 67%

From Lewis 2020 to Contextual Retrieval 2024

**Lewis et al. 2020 (Facebook AI Research)** - the original RAG paper: DPR encoder + BART generator, first demonstration that retrieval + generation outperforms pure LM on knowledge-intensive tasks. **Khattab et al. 2020 - ColBERT**: late interaction - instead of one embedding per document, a matrix of token-embeddings with MaxSim scoring. More accurate than bi-encoder, faster than cross-encoder. **Gao et al. 2022 - HyDE**: hypothetical document embedding as a pre-retrieval transformation, +5-10% recall on the BEIR benchmark. **Anthropic Contextual Retrieval 2024**: BM25 + embeddings + LLM-generated chunk context (each chunk is enriched with a summary of its position in the document) - 67% reduction in failed retrievals.

Where Naive RAG Breaks Down

The naive RAG demo shows 95% accuracy - 10 test questions, clean chunks, friendly queries. Production shows 55% on real users. First instinct: upgrade the model. Wrong. GPT-4 won't help if the context was never retrieved. The problem is always in retrieval.

ScenarioNaive RAG ProblemExample
Keyword mismatchVector search can't find exact termsQuery "payment error 500" - the document contains "PaymentGatewayException", embeddings don't match
Complex questionA single embedding can't express a multi-hop query"How are rate limits related to API cost?" - needs chunks from different sections
Imprecise retrievalTop-5 chunks contain 2 relevant and 3 noisy onesThe LLM gets distracted by irrelevant chunks and gives a vague answer
Ambiguous queryThe user asks imprecisely"How to set up authentication?" - OAuth? JWT? API keys? RBAC?

Advanced RAG is three layers of surgery around the search. Not a replacement - an upgrade:

  1. **Pre-retrieval** - improve the query before searching (query transformation, HyDE by Gao et al. 2022)
  2. **Retrieval** - improve the search itself (BM25 + dense hybrid, multi-query with RRF)
  3. **Post-retrieval** - filter results after searching (cross-encoder reranking, contextual compression)

Each layer is added independently. Hybrid search - always. Re-ranking - when precision matters more than latency. HyDE - when queries are short and ambiguous. Combining all three is Contextual Retrieval (Anthropic, 2024), which cuts retrieval failures by 67%.

A user asks "deployment error", but the documentation describes the issue as "DeploymentFailedException in CI/CD pipeline". Naive RAG doesn't find the answer. What's the problem?

Hybrid Search: BM25 + dense vectors

Two worlds - two failures in isolation. Vector search finds "how to reduce latency" near "latency optimization" - semantics works. But ask "error E-4012" - the embedding has no idea what that code is. BM25 (TF-IDF on steroids) finds E-4012 instantly, but won't understand that "reduce latency" and "optimize latency" mean the same thing.

MethodStrengthsWeaknesses
Vector (semantic)"How to reduce latency?" → finds "latency optimization"Misses exact terms: "error E-4012"
BM25 (keyword)Exact match: "E-4012" → finds the document with that codeDoesn't understand synonyms: "reduce latency" ≠ "latency"
Hybrid (BM25 + dense)Combines: semantics + keywordsRequires weight tuning (alpha)

**Reciprocal Rank Fusion (RRF)** bridges both worlds - it doesn't average scores (they're incomparable), it merges rank lists. A document ranked 1st in BM25 and 3rd in vector search scores higher than one ranked 10th in both:

Implementing hybrid search with pgvector + PostgreSQL full-text search:

The **alpha** parameter is tuned empirically. For technical documentation (lots of error codes, API names) - alpha 0.3-0.4 (more weight on keywords). For general questions - alpha 0.6-0.7 (more weight on semantic). Perplexity uses a similar balance - it's the core of their search technology.

More chunks = better recall

More chunks = worse precision, an overloaded reranker, and more noise in the context

Fetching top-100 instead of top-20 does technically improve recall - the right document will almost certainly make the cut. But the cross-encoder reranker does a forward pass for every (query, doc) pair: 100 pairs instead of 20 means +400ms latency. More importantly, the LLM receives a noisy context - 80 irrelevant chunks mixed with 20 useful ones - and starts hallucinating or giving vague answers. The sweet spot: retrieve 15-25 candidates, rerank to 5-7, pass no more than 5 to the context.

In hybrid search, alpha = 0.3 means...

Re-ranking: Cross-encoder and Cohere Rerank

Hybrid search returned the top-20 documents. But the ranking is still rough - cosine similarity between a query embedding and a document embedding doesn't capture subtle semantic relationships. A bi-encoder encodes query and doc independently: they never "see" each other before comparison.

A cross-encoder is a different architecture entirely. Query and document are fed together as a single input, the model runs full attention between every token in the query and every token in the document. Slower - but precise:

Bi-encoder (embedding)Cross-encoder (reranker)
How it worksQuery and doc are encoded separately, compared via cosine similarityQuery and doc are fed together, model outputs a relevance score
SpeedFast - doc is already encoded, comparison is instantSlow - needs a forward pass for each pair
AccuracyGoodSignificantly better - the model sees both texts simultaneously
ScaleMillions of documentsTop 20-50 candidates (post-retrieval)

The **retrieve-and-rerank** pattern is the production RAG standard. Cast a wide net at retrieval, apply a precise filter at reranking:

Cohere Rerank - 1 dollar per 1000 requests (each up to 100 documents). For a self-hosted option - **bge-reranker-v2-m3** by BAAI (sentence-transformers), runs on GPU via Hugging Face Inference. On CPU with 20 documents - around 300ms.

Why is a cross-encoder more accurate than a bi-encoder for ranking?

Query Transformation: HyDE, Multi-Query, Step-back

A user types "why doesn't the deployment work" - four words. The embedding of those four words is a single point in vector space, small and poorly oriented. The actual answer lives in a multi-page document with rich vocabulary. The distance between the question-point and the document-cloud is enormous. Query transformation moves the point before the search begins.

HyDE - Hypothetical Document Embedding

The idea from Gao et al. (2022): don't search by the query embedding - generate a **hypothetical answer** and search by its embedding instead. A hypothetical answer is written in the style of documentation, rich in terminology, long - its embedding sits much closer to real documents in vector space.

Multi-Query - Split the Question into Sub-queries

Step-back Prompting - Generalize the Question

Instead of the specific "why doesn't NestJS middleware catch async errors?" - first ask "how does error handling work in NestJS middleware?". A more general query finds foundational documents that contain the specific answer as a special case.

HyDE excels at Q&A over documentation - short questions, long documents. Multi-Query - for complex analytical questions with multiple aspects. Step-back - for "why doesn't X work?" where the cause is a principle, not a detail. In practice, RAPTOR and Contextual Retrieval (Anthropic 2024) combine all three.

HyDE (Hypothetical Document Embedding) searches by the embedding of...

Self-RAG and Contextual Compression

What if the user asks "what is 2+2?" - why run retrieval at all? And if 5 chunks were found but only 1 is actually relevant - why send all five to the context? **Self-RAG** is a pattern where the model makes both decisions itself: whether to search, and what to use from what was found.

Self-RAG: The Model as Critic

  1. LLM receives the question and decides: is retrieval needed? (some questions the model already knows)
  2. If yes - retrieval, get chunks
  3. LLM evaluates each chunk: relevant? Useful for the answer?
  4. Generates an answer based only on approved chunks
  5. LLM evaluates the final answer: is it grounded in context? Useful to the user?

Contextual Compression

A chunk retrieved is 500 tokens - but only 2 sentences actually answer the question. The other 18 are noise, consuming context window and distracting the model. **Contextual compression** (popularized by LangChain, refined in Anthropic Contextual Retrieval 2024) extracts only what matters.

Self-RAG and compression add extra LLM calls. In production that's +200-500ms latency and +0.001-0.01 dollars per query. Use when accuracy matters more than speed - medical, legal, financial chatbots. Abridge and Nabla apply exactly this to verify every claim before showing it to a physician.

Contextual compression in RAG solves the problem of...

RAG Fusion and Parent Document Retriever

RAG Fusion

RAG Fusion combines Multi-Query and RRF into a single pipeline. Multiple query variations are generated, each searches independently, and results are merged via Reciprocal Rank Fusion. The insight: one query formulation is one point in vector space. Four formulations are four points - covering far more territory.

Parent Document Retriever

Small chunks give precise retrieval, poor context. Large chunks give poor retrieval, good context. A classic trade-off. **Parent Document Retriever** cuts it in half: stores small chunks for search (128 tokens), large parent chunks for context (1024 tokens). Searches the small, returns the large.

Summary of all advanced RAG techniques and when to apply them:

TechniqueWhen to useOverhead
Hybrid SearchAlways - baseline for production+10ms (SQL query)
Re-rankingWhen precision matters more than latency+200ms, USD 0.001/query
HyDEQ&A over documentation+300ms, USD 0.001/query
Multi-QueryComplex analytical questions+300ms, USD 0.001/query
RAG FusionMaximum recall, critical accuracy+500ms, USD 0.003/query
Self-RAGMedicine, law, finance+500ms, USD 0.005/query
Parent Doc RetrieverLong documents requiring context+5ms (different data schema)
Contextual CompressionLimited context window+200ms, USD 0.001/query

If naive RAG produces poor answers, just add more documents to the index. The bigger the corpus, the better the retrieval.

Retrieval quality is bottlenecked by query formulation and relevance distribution, not corpus size. Hybrid search, re-ranking and query transformation outperform corpus growth in 3-5x of cases on the same index.

Intuition borrowed from classical SQL or full-text systems: more data, sharper answer. In RAG the opposite often holds - a larger corpus increases the chance that naive vector search drowns in semantic near-duplicates and misses an exact match. The fix is an ensemble of methods on top of the same base, not raw volume.

Parent Document Retriever searches by small chunks but returns large parent chunks. Why?

More chunks = better recall, just fetch top-100

More chunks = worse precision, an overloaded reranker, and more noise in the context

Recall technically improves - the right document will almost certainly land in top-100. But the cross-encoder reranker does a forward pass for every (query, doc) pair: 100 pairs instead of 20 means +400ms. More critically, the LLM receives a noisy context - 80 irrelevant chunks mixed with 20 useful ones - and starts hallucinating or giving vague answers. The sweet spot: retrieve 15-25 candidates, rerank to 5-7, pass no more than 5 to the context.

Key Takeaways

  • Naive RAG fails on keyword mismatch, complex questions, and ambiguous queries - the problem is always in retrieval, not the model
  • Hybrid search (BM25 + dense + RRF) is the mandatory production baseline - covers keyword mismatch with near-zero latency cost
  • Cross-encoder reranking (sentence-transformers, Cohere) delivers +15-20% precision on top-5: the model sees query and doc together
  • HyDE (Gao 2022) searches by the embedding of a hypothetical answer - for short questions against long documents
  • Self-RAG + contextual compression - for medicine, law, finance where hallucinations are unacceptable
  • Contextual Retrieval (Anthropic 2024): BM25 + embeddings + chunk context summaries - 67% fewer retrieval failures

Вопросы для размышления

  • Of the four naive RAG failure scenarios (keyword mismatch, multi-hop, noisy retrieval, ambiguous query), which one would hurt a specific product the most?
  • Hybrid search adds +10ms, reranking +200ms, HyDE +300ms. What's a reasonable latency budget for a specific use case - and which combination fits?
  • Self-RAG adds approximately 5 LLM calls per user query. At what point does the accuracy gain justify that cost?

What's Next

Retrieval is tuned, but its quality critically depends on how documents are split. Chunking strategy is the next optimization lever.

  • Chunking Strategies — How to properly split documents - fixed, recursive, semantic chunking
  • Conversation Memory — RAG in the context of a chatbot - how to combine retrieval with conversation history
  • Evaluation — How to systematically measure RAG pipeline quality on golden datasets

Связанные уроки

  • aie-12-rag-fundamentals — Baseline RAG is the foundation for advanced techniques
  • aie-09-embeddings — Dense retrieval quality depends on embedding quality
  • aie-10-vector-databases — Qdrant HNSW index is critical for hybrid search performance
  • aie-14-chunking-strategies — Proper chunking doubles precision before any reranking
  • prob-04-bayes — Reranking is a posterior update over prior retrieval scores
  • ml-52-search-ranking — Cross-encoder reranking from IR learning-to-rank
  • db-26-caching
  • db-28-search
Advanced RAG: hybrid search, re-ranking, query expansion, self-RAG

0

1

Sign In