Natural Language Processing
Sentence and Document Embeddings
2019. Berlin. Nils Reimers publishes one paper - and semantic search becomes accessible to every developer. Before SBERT: 65 GPU hours for 10,000 sentences through a BERT cross-encoder. After: 5 seconds on CPU. A 47,000x difference - not from better hardware, but from one architectural idea.
- Every RAG pipeline at Perplexity, Notion AI, and Linear AI runs on sentence embeddings - SBERT or its direct descendants
- GitHub Copilot uses code embeddings on the same principles for semantic search over repositories
- OpenAI text-embedding-3-small - 1536 dim, trained with Matryoshka loss, USD 0.02 per million tokens
- Qdrant, Pinecone, Weaviate - the entire vector database market was built to store sentence embeddings
- CLIP (OpenAI) uses the same NT-Xent contrastive loss to link text and images in a shared embedding space
Предварительные знания
- Word embeddings: dense vectors and cosine similarity
- Contextual representations and the BERT architecture
- The semantic search task: ranking by vector proximity
From doc2vec to Sentence-BERT
Representing a whole sentence or document as a single vector grew in three steps. In 2014 Quoc Le and Tomas Mikolov proposed Paragraph Vector, better known as doc2vec, which extended word2vec with a trainable vector for an entire paragraph. In 2018 Google released the Universal Sentence Encoder, producing general-purpose sentence embeddings that transferred across tasks. The decisive step came from Nils Reimers and Iryna Gurevych in 2019 with Sentence-BERT: they reshaped BERT into a siamese network so that comparing sentences by vector proximity took milliseconds instead of hours. That turned semantic search from an expensive research trick into a tool any developer could reach for.