Natural Language Processing

Sentence and Document Embeddings

2019. Berlin. Nils Reimers publishes one paper - and semantic search becomes accessible to every developer. Before SBERT: 65 GPU hours for 10,000 sentences through a BERT cross-encoder. After: 5 seconds on CPU. A 47,000x difference - not from better hardware, but from one architectural idea.

Every RAG pipeline at Perplexity, Notion AI, and Linear AI runs on sentence embeddings - SBERT or its direct descendants
GitHub Copilot uses code embeddings on the same principles for semantic search over repositories
OpenAI text-embedding-3-small - 1536 dim, trained with Matryoshka loss, USD 0.02 per million tokens
Qdrant, Pinecone, Weaviate - the entire vector database market was built to store sentence embeddings
CLIP (OpenAI) uses the same NT-Xent contrastive loss to link text and images in a shared embedding space

Предварительные знания

Word embeddings: dense vectors and cosine similarity
Contextual representations and the BERT architecture
The semantic search task: ranking by vector proximity

From doc2vec to Sentence-BERT

Representing a whole sentence or document as a single vector grew in three steps. In 2014 Quoc Le and Tomas Mikolov proposed Paragraph Vector, better known as doc2vec, which extended word2vec with a trainable vector for an entire paragraph. In 2018 Google released the Universal Sentence Encoder, producing general-purpose sentence embeddings that transferred across tasks. The decisive step came from Nils Reimers and Iryna Gurevych in 2019 with Sentence-BERT: they reshaped BERT into a siamese network so that comparing sentences by vector proximity took milliseconds instead of hours. That turned semantic search from an expensive research trick into a tool any developer could reach for.

SBERT: Sentence-BERT and Siamese Networks

2019. Nils Reimers publishes SBERT. Before it, semantic search over 10,000 sentences using BERT cross-encoder took 65 hours - every pair had to go through BERT individually. After: 5 seconds. Quality barely changed. A 47,000x speedup.

The problem with original BERT: it can compare two sentences, but only when both are fed together in a single forward pass. The format `[CLS] A [SEP] B [SEP]` requires $O(n^2)$ pairs. 10,000 sentences = 50 million pairs. SBERT breaks that constraint.

The key idea of SBERT: siamese architecture. Two identical BERTs (with shared weights) process sentences independently. Each sentence collapses into a single vector. Comparison is cosine similarity between those vectors. $O(n)$ instead of $O(n^2)$.

Siamese - named after Siamese twins. Two neural networks with identical weights, looking at different inputs. After the BERT encoder, pooling is applied - the sentence is compressed into a single fixed-size vector (typically 768 dimensions). That vector is the semantic fingerprint of the sentence.

SBERT is trained on Natural Language Inference (NLI). Sentence pairs labeled entailment, contradiction, or neutral. The loss is a 3-class softmax over the concatenation of vector difference, element-wise product, and absolute difference. Later fine-tuned on Semantic Textual Similarity (STS) via regression loss.

The result: `sentence-transformers` on PyPI, 350+ pretrained models, 30 million downloads per month. Every RAG pipeline in 2024-2025 uses SBERT or its direct descendants. `all-MiniLM-L6-v2` is 80 MB, runs on CPU, returns 384-dimensional vectors in 5 ms.

Why is original BERT impractical for semantic search over a large document base?

Bi-encoder vs Cross-encoder: Speed vs Quality

Two worlds, two tradeoffs. Bi-encoder (SBERT) - precomputes each document vector independently, stores it in an index, and at search time does one matrix multiply. Cross-encoder - feeds the pair (query, document) into BERT together, sees token-level interactions. One is fast. The other is accurate.

Real systems use both. First, the bi-encoder retrieves top-100 from millions of documents in 10 ms - ANN search via HNSW in Qdrant or Pinecone. Then the cross-encoder reranks top-100 in 200 ms - examining each pair in detail. Result: cross-encoder quality at bi-encoder speed. This is called retrieve-and-rerank.

Cross-encoder outperforms because it sees token-level interactions between query and document through self-attention. Bi-encoder compresses meaning into one vector - information is lost. On long or ambiguous texts the quality gap reaches 10-15% NDCG@10.

Popular bi-encoders: `all-MiniLM-L6-v2` (384 dim, fast), `all-mpnet-base-v2` (768 dim, more accurate), `multilingual-e5-large` (100+ languages). Popular cross-encoders: `cross-encoder/ms-marco-MiniLM-L-6-v2` - trained on 500K query-document pairs from Microsoft MARCO.

There are hybrids too. Late interaction - ColBERT. Instead of one vector per sentence - a matrix of token vectors. MaxSim between query and document matrices. Quality close to cross-encoder, speed closer to bi-encoder. Used in RAGatouille and enterprise search.

In a real search system, bi-encoder and cross-encoder are used in sequence. What does each do?

Contrastive Learning: SimCSE and NT-Xent

The goal: teach the model that similar sentences must be close in embedding space, and dissimilar ones far apart. Without labeled pairs. Contrastive learning solves this with provocative simplicity: take one sentence, run it through the model twice with different dropout masks - get two slightly different vectors. That is the positive pair.

SimCSE (2021, Gao et al.) - Simple Contrastive Learning of Sentence Embeddings. One forward pass with dropout 0.1, another with a different dropout mask. One sentence, two augmented views. All other sentences in the batch are negatives. Loss: NT-Xent (Normalized Temperature-scaled Cross-Entropy), also known as InfoNCE.

Temperature $\tau$ (typically 0.05 in SimCSE) is critical. Low temperature - sharp cluster boundaries, faster discrimination. High - softer, more noise-tolerant. In CLIP (OpenAI), $\tau$ is a learned parameter. In SimCSE it is fixed.

Why does this work? Dropout as data augmentation for text. Two forward passes of the same sentence produce semantically identical but numerically different vectors. The model learns an invariance: meaning must not depend on dropout noise. This principle underpins all self-supervised NLP: from BERT-MLM to contrastive pretraining in CLIP, ALIGN, and CoCa.

Supervised SimCSE does even better: the positive pair is an entailment from NLI, the hard negative is a contradiction. The model learns from human logic - what it means to say the same thing and to say the opposite. On STS-B Spearman correlation jumps from 74% (unsupervised) to 81% (supervised). Plain BERT without fine-tuning sits around 53%.

Document embeddings - same story, different scale. For long documents, mean pooling over all tokens often beats CLS. The CLS token in BERT is designed for sentence-level tasks during fine-tuning - without fine-tuning it carries little semantic signal. Mean pooling averages information across all tokens. SBERT experiments showed: mean > max > CLS by STS correlation.

Matryoshka Representation Learning (MRL, 2022) - one model, multiple dimensions on demand. Trains such that the first 64 dimensions already carry semantic signal, the first 128 carry more, and so on. `text-embedding-3-small` from OpenAI is trained with MRL - truncate to 256 dim and lose only ~3% quality on MTEB.

How does unsupervised SimCSE create positive pairs for contrastive learning?

Итоги

SBERT: siamese BERT with shared weights, mean pooling, cosine similarity - O(n) instead of O(n^2)
Bi-encoder is fast (precomputed index); cross-encoder is accurate (sees token interactions) - production systems use both in sequence
SimCSE: dropout as augmentation, NT-Xent loss, temperature $\tau = 0.05$ - SOTA sentence embeddings without labeled data
Mean pooling beats CLS for document embeddings - CLS without fine-tuning carries little semantic signal
Matryoshka loss (MRL) - one model for all dimensions; OpenAI text-embedding-3 is trained with it

Вопросы для размышления

SBERT produces one vector per sentence through mean pooling. What information does that vector lose compared to the full BERT output (a matrix of token vectors)? In which scenarios does that loss become critical?

Связанные уроки

nlp-05 — ELMo and contextual embeddings - the predecessor to BERT/SBERT
nlp-12 — SBERT builds on BERT with a siamese architecture
aie-09-embeddings — sentence-transformers in production - direct application of SBERT
aie-12-rag-fundamentals — RAG retrieval runs on sentence embeddings via bi-encoder
it-03 — NT-Xent / InfoNCE loss is a special case of KL minimization via mutual info
la-15-svd