Information Retrieval
Neural Ranking
2019. Google deploys BERT for search. Before: keyword matching plus PageRank. After: semantic query understanding. CTR up 10% with the same documents. A user types 'what is the best medicine for cough' - before: results matched words; after: results matched intent. But that is just the start. BM25 (a TF-IDF variant): 1ms over a million documents. Bi-encoder (dense): 5ms plus ANN search. Cross-encoder: 100ms for 100 documents. RAG pipeline: BM25 plus bi-encoder retrieval plus cross-encoder rerank = 200ms end-to-end. Three levels of neural ranking - each with its own speed-quality trade-off.
- **Google Search**: BERT for query understanding + MUM for multimodal - semantic ranking since 2019
- **Perplexity.ai**: bi-encoder retrieval + cross-encoder rerank + LLM synthesis in one pipeline
- **Qdrant + ColBERT**: late interaction for enterprise search with 5x smaller index than full cross-encoder
- **GitHub Copilot**: bi-encoder for code retrieval + cross-encoder for context selection in RAG pipeline
Bi-encoder: separate encoding of query and document
2019. Google deploys BERT for search. Before: keyword matching and PageRank. After: semantic query understanding. CTR up 10% with the same documents. A user types 'what is the best medicine for cough' - before BERT, results matched 'medicine' and 'cough' as words; after, results matched intent. Bi-encoder is the architectural foundation of this shift. Two transformers (or one shared model) encode query and document independently into fixed-size vectors. Similarity: cosine or dot product between vectors. Documents can be encoded offline and stored in an ANN index (FAISS, HNSW). Training uses contrastive loss: in-batch negatives (other documents in the batch as negatives) and hard negatives (mined via BM25). The sentence-transformers library wraps the full pipeline. Key models: BGE-M3, E5-large, GTR-XXL.
Bi-encoder latency profile: document encoding - offline O(N). Query encoding - online ~10ms. HNSW ANN search - ~5ms for 1M documents. Total per query: ~15ms. This makes bi-encoder the only viable option for first-stage retrieval over large corpora. Qdrant, Pinecone, Weaviate all implement this same pattern.
Hard negative mining is critical for bi-encoder quality. Strategy: BM25 returns top-100 for a query, from these select documents with high BM25 score but low relevance label. Such hard negatives force the model to distinguish semantically similar but non-relevant documents.
Why does the bi-encoder scale to first-stage retrieval over millions of documents while the cross-encoder does not?
Cross-encoder: full attention between query and document
A dangerous illusion: 'bi-encoder is enough for search'. The numbers: bi-encoder MRR@10 on MS MARCO - 0.33. Cross-encoder on the same data - 0.39. An 18% quality gap. The cross-encoder concatenates query and document into a single input: [CLS] query [SEP] document [SEP]. The transformer processes this as one sequence - every query token attends to every document token through full self-attention. Output: scalar score [0, 1] via the [CLS] token and a linear head. The problem: no precomputed embeddings, no index. For a corpus of 1M documents - 1M forward passes per query. At 100ms per inference - 28 hours per query. Models: ms-marco-MiniLM-L-6-v2 (fast), MonoT5-3B (quality).
Cross-encoder with O(N) latency is only applicable to a pre-filtered candidate list. Standard pattern: bi-encoder retrieves 100-1000 candidates, cross-encoder reranks to top-10. Applying a cross-encoder to a full corpus without a first stage is a common design mistake in search pipelines.
Cross-encoder training: pointwise (BCE loss, single document, binary relevance) or pairwise (margin ranking loss, two documents, comparison). Pairwise shows better results on the BEIR benchmark. The MS MARCO Passage Ranking dataset contains 8.8M passages and is the standard for training and evaluating both model types.
Why does the cross-encoder achieve higher quality than the bi-encoder with the same base model?
ColBERT: late interaction and per-token embeddings
The ColBERT paradox: the model stores an embedding for every token in a document, not a single vector. 128x more memory than a bi-encoder. Yet MaxSim matching beats a neural cross-encoder by 5 NDCG points at the same latency. How it works: query encoder produces matrix [Lq, 128], document encoder produces [Ld, 128]. Similarity score = MaxSim: for each query token, find the maximum cosine with any document token, then sum. This is late interaction - the interaction happens after the transformer, not inside it. PLAID index (ColBERT v2) solves the memory problem: centroid search plus 2-bit quantization, efficient MaxSim via IVFPQ. Stanford DSP (Demonstrate-Search-Predict) uses ColBERT for multi-hop reasoning in RAG pipelines.
ColBERT v2 (Santhanam et al., 2022) adds residual compression: each token embedding is approximated as centroid plus residual, stored at 2 bits per dimension. PLAID retrieval: centroids are sorted by query-centroid similarity, only the top-K clusters are fully decoded. Result: memory comparable to bi-encoder, quality above cross-encoder on CoT benchmarks.
What advantage does ColBERT's MaxSim operation have over the bi-encoder dot product?
Reranking pipeline: from BM25 to neural top-10
Bi-encoder fast retrieval and cross-encoder reranking are not competitors. They form a pipeline. Bi-encoder retrieves 100 candidates in 1ms. Cross-encoder selects top-10 from 100 in 200ms. Only together do they deliver quality. Full production pipeline: BM25 to top-1000 (1ms, exact keyword match, recall@1000=0.85) to bi-encoder rerank to top-100 (5ms, semantic filter) to cross-encoder rerank to top-10 (200ms, precision). Final result: ~207ms end-to-end, NDCG@10 approximately 0.42 (vs 0.28 for BM25 alone). BM25 at the first stage is critical: it is cheap and has high recall - it does not miss documents with exact keyword matches. Cohere Rerank API provides a cross-encoder as a service. BEIR benchmark evaluates zero-shot retrieval quality across 18 datasets from different domains.
Critical misconception: 'adding a reranker always improves search'. If the first stage missed a relevant document, reranking cannot recover it. Recall@100 after the first two stages is the hard ceiling of the entire pipeline quality. A common mistake: tuning the cross-encoder reranker when the real problem is low recall in the first stage.
LlamaIndex and LangChain provide ready integrations: CohereRerank, SentenceTransformerRerank, BGERerank. Cohere Rerank API v3 achieves NDCG@10 = 0.553 average across 18 BEIR datasets - best among ready-to-use APIs without fine-tuning.
Adding a reranker always improves search; placing a cross-encoder after a bi-encoder is sufficient
A reranker improves precision only if the first stage delivers sufficient recall. Recall@K of the first stage is the hard ceiling of the entire pipeline quality.
Reranker reorders candidates but adds no new ones. If a relevant document did not reach the top-K at the first stage, no reranker can recover it.
Recall@100 after bi-encoder retrieval is 0.70. The cross-encoder reranker shows strong metrics on the training set. What will the actual NDCG@10 be?
Key ideas
- **Bi-encoder**: separate encoding of query and doc, precomputed embeddings + ANN - O(log N) retrieval, MRR@10 approx 0.33 on MS MARCO
- **Cross-encoder**: full self-attention over query+doc, scalar score, O(N) latency - only practical for reranking candidates, MRR@10 approx 0.39
- **ColBERT**: per-token embeddings + MaxSim - captures exact-match and semantic signals simultaneously, beats cross-encoder by 5 NDCG points
- **Pipeline**: BM25 to 1000 to bi-encoder to 100 to cross-encoder to 10; recall of the first stage is the hard ceiling of the full pipeline
Related topics
Neural ranking connects IR with NLP, ML, and production search systems:
- Learning to Rank — Preceding approach - feature-based LtR that neural ranking extends
- BM25 and classical IR — First pipeline stage - without BM25 recall, neural ranking loses effectiveness
- Qdrant Vector Search — Practical implementation of bi-encoder retrieval with HNSW index
- AI Engineering: API Integration — Cohere Rerank API and embedding APIs in production RAG pipelines
- ML Evaluation — NDCG, MRR, MAP - metrics for evaluating neural rankers on the BEIR benchmark
Вопросы для размышления
- ColBERT stores 128x more data than a bi-encoder yet beats a cross-encoder in quality at the same latency - what architectural principle explains this and where are its limits?
- BM25 remains the first stage even in the most modern neural pipelines - why does neural retrieval not fully replace it at this stage?
- Recall@100 of the first stage is a hard ceiling for the whole pipeline. How would the improvement strategy differ when recall@100 is 0.65 versus 0.90?