AI Engineering

Embeddings: Turning Text into Vectors for Search and Comparison

Цели урока

Understand what embeddings are and how text is transformed into a vector of numbers
Compare embedding models and choose the right one for the task
Master cosine similarity and other distance metrics between vectors
Generate embeddings via API with batch optimization
Apply embeddings for search, deduplication, classification, and anomaly detection

Netflix stores every film as a vector of 128 numbers. Spotify - every song. Tinder - every person. Same math, three domains, billions of matches. 'king - man + woman = queen' doesn't come from rules - it emerges from compressing 300 billion words into 1536 numbers. That's not magic. It's a side effect of linear algebra at scale.

GitHub Copilot uses embeddings to find relevant code in a repository - every file becomes a vector, search works via cosine similarity
Notion AI builds a semantic index of all user pages via embeddings - knowledge base search without keywords
Intercom classifies 500K+ tickets per day via embedding comparison (USD 0.00002/ticket) - 500x cheaper than GPT-4o
Stack Overflow uses embeddings to find duplicate questions - 'already asked' matches by meaning, not by words

Evolution of Embedding Technology

**Word2Vec (Mikolov, Google, 2013)** - the first model to show that words can be represented as vectors with structure. That's where the famous 'king - man + woman ≈ queen' came from - not programmed explicitly, but emerging from geometry. **GloVe (Stanford, 2014)** improved the approach through global co-occurrence statistics. **BERT embeddings (Devlin, Google, 2018)** - the shift from words to context: one token gets a different vector depending on the sentence. **OpenAI text-embedding-ada-002 (2022)** - commercialization: high quality via a simple API call. **text-embedding-3-small/large (2024)** - flexible dimensions, 5x cheaper than ada-002 with better quality.

Предварительные знания

How LLMs Work: Tokens, Embeddings, Attention

What Are Embeddings and Why They Matter

Netflix stores every film as a vector of 128 numbers. Spotify - every song. Tinder - every person. Same math, three domains, billions of matches.

A vector of numbers - and suddenly an algorithm knows that 'The Godfather' is closer to 'Once Upon a Time in America' than to 'Shrek 2'. How? Not from explicit rules - from the structure of data. That's an embedding: **compressing meaning into coordinates of a high-dimensional space**.

Each of the 1536 dimensions encodes some aspect of meaning. Not 'topic' or 'sentiment' directly - these are abstract features learned by the model from hundreds of billions of words. The effect is clear: semantically similar texts receive similar coordinates.

Word2Vec (Mikolov, Google, 2013) was the first model to show this property. That's where the famous **king - man + woman ≈ queen** came from. Not programmed explicitly - it emerged from the geometry of vector space. GloVe (2014, Stanford) solidified the approach. BERT embeddings (2018) moved it to full sentence context. OpenAI text-embedding-ada-002 (2022) made all of this available through a single API call.

**GPS analogy:** latitude and longitude encode position on a map using two numbers. An embedding encodes a text's position in 'meaning space' - except there are not 2 dimensions, but 1536. And the distance between points is semantic closeness.

For a backend developer, embeddings are the foundation for an entire class of tasks:

**Semantic Search** - finding documents by meaning, not keywords
**RAG (Retrieval-Augmented Generation)** - feeding relevant context into an LLM
**Deduplication** - finding semantic duplicates in a database
**Clustering** - automatically grouping tickets, reviews, messages
**Recommendations** - 'similar articles', 'similar products' based on description meaning
**Anomaly Detection** - identifying texts that are significantly different from the rest

A text embedding is:

Models for Creating Embeddings

Embedding models are a separate class. They don't generate text - they **compress meaning into a vector**. Trained differently: the goal isn't 'predict the next token', it's 'make similar texts close in space'. Model choice affects search quality, speed, and cost.

Model	Dimensions	Price / 1M tokens	Features
text-embedding-3-small (OpenAI)	1536	USD 0.02	Best price/quality ratio, flexible dimensions
text-embedding-3-large (OpenAI)	3072	USD 0.13	Maximum OpenAI quality, supports dimensions param
embed-v4.0 (Cohere)	1024	USD 0.10	Multilingual, input_type for different tasks
voyage-3 (Voyage AI)	1024	USD 0.06	Best for code retrieval
nomic-embed-text (open-source)	768	free	Run locally, Ollama-compatible
BGE-M3 (BAAI, open-source)	1024	free	Multilingual, dense + sparse embeddings

Model quality is measured via the **MTEB benchmark** (Massive Text Embedding Benchmark) - the industry standard. text-embedding-3-small consistently ranks in the top 15. BGE-M3 competes with paid solutions for multilingual tasks.

OpenAI text-embedding-3 supports **flexible dimensions**: request a shorter vector (256, 512) with minimal quality loss - saving memory and speeding up search in Qdrant or pgvector.

**Critically important:** the embedding model cannot be changed after building an index. If the database is built on text-embedding-3-small, searching with a vector from text-embedding-3-large won't work - the spaces are incompatible. Migration = recalculating all embeddings.

For most backend tasks, text-embedding-3-small is the optimal choice. Processing 1 million documents of average length (500 tokens) costs ~`USD 10` for embeddings. This is a one-time operation - a vector is only recalculated when the text changes.

Why can't the embedding model be swapped after the index is built without recalculating?

Cosine Similarity: How to Measure Vector Closeness

Embeddings turn text into a vector. The next step is to **compare** two vectors. The intuitive answer: compute the distance between points. That's a trap.

Embedding models return vectors of different magnitudes for different texts. A long document will have a larger norm than a short phrase - even if the meaning is identical. Euclidean distance doesn't account for this.

**Cosine similarity** doesn't measure distance - it measures the **angle between vectors**. Vector magnitude doesn't matter. Only direction does. Result: a number from -1 to 1:

**1.0** - vectors point in the same direction (identical meaning)
**0.0** - vectors are orthogonal (no relationship)
**-1.0** - vectors point in opposite directions (practically never seen with text embeddings)

Metric	Formula	When to Use
Cosine Similarity	cos(θ) = dot(A,B) / (‖A‖ × ‖B‖)	Standard choice for text embeddings
Euclidean (L2)	‖A - B‖₂	When absolute vector magnitude matters
Dot Product	A · B	For normalized vectors (equivalent to cosine)
Manhattan (L1)	Σ\|Aᵢ - Bᵢ\|	Sparse data, less sensitive to outliers

**Practical note:** OpenAI text-embedding-3 returns already **normalized** vectors (length = 1). For normalized vectors, cosine similarity = dot product. Dot product is one operation instead of three. That's why Qdrant, pgvector, and other vector databases use dot product internally when working with normalized vectors.

Cosine similarity between two embeddings is 0.92. This means:

Generating Embeddings: API, Batch Processing, and Optimization

In production, embeddings need to be generated for thousands and millions of documents. The naive approach: one request per document. 10,000 documents x 200ms = 33 minutes. That's not production - that's waiting.

**Batch processing** - up to 2048 texts in one request. The same 10,000 documents = 5 requests ≈ 5 seconds. A 400x difference.

Parameter	text-embedding-3-small	text-embedding-3-large
Max tokens per input	8191	8191
Max texts per batch	2048	2048
Rate limit (tier 1)	3,000 RPM / 1M TPM	3,000 RPM / 1M TPM
Vector dimensions	1536 (default)	3072 (default)
Minimum dimensions	256	256

For clustering and deduplication (comparing texts against each other) - dimensions: 256 or 512 is enough. For semantic search and RAG, full dimensions are recommended: quality visibly drops on short texts with reduced dimensions.

**Text longer than 8191 tokens is silently truncated.** The API won't throw an exception - it simply ignores the tail. Check the length before sending and split the text into chunks if needed (chunking is a separate topic covered in the chunking strategies lesson).

To generate embeddings for 10,000 documents via OpenAI API with batch size 2048, what is the minimum number of API requests required?

Practical Applications of Embeddings in Backend

Embeddings are not an academic concept. They're a practical tool that already solves real problems cheaper and faster than LLMs. Five patterns - with code.

1. Semantic Search

Keyword search matches words. Semantic search finds documents by **meaning** - even when not a single word in the query matches the document. Query 'app is slow on older phones' finds 'Performance optimization for low-end devices' - no shared words whatsoever.

2. Content Deduplication

3. Anomaly Detection

If a new text's embedding is far from all others in the collection - it's an anomaly. Useful for spam filtering, detecting irrelevant content, and monitoring.

4. Clustering

Embeddings allow automatic grouping of content without manual rules. Simply cluster the vectors - texts on the same topic will end up in the same cluster. That's how Spotify mood playlists work, and YouTube's related videos.

5. Classification by Examples (Few-shot)

**Embedding classification vs LLM classification:** The embedding approach costs `USD 0.00002` per ticket and runs in 50ms. The LLM approach (GPT-4o) costs `USD 0.01` per ticket and runs in 500ms. For high-volume tasks (thousands of tickets per day), embeddings are 500x more cost-effective.

For classifying 50,000 support tickets per day into 5 categories, the optimal approach is:

Cosine similarity = distance between vectors

Cosine similarity is the angle between vectors, not distance. High cosine doesn't guarantee semantic closeness in out-of-domain tasks

Cosine similarity measures cos(theta) - the angle between directions. Two vectors can be 'close in direction' (score 0.85), yet represent texts from different domains where that closeness is meaningless. A score of 0.8 for cooking texts and a score of 0.8 for legal documents are completely different things. Thresholds must be calibrated per domain.

The embedding model can be chosen and swapped later - it's just an API

Changing the model = recalculating the entire index. Different models create incompatible vector spaces

text-embedding-3-small and text-embedding-3-large are different spaces, even if dimensions match after the dimensions parameter. The same text gets different coordinates in different spaces. Comparing a vector from one model with a vector from another is like comparing GPS coordinates with screen pixels.

Key Takeaways

Embedding - a vector of 1536 numbers encoding meaning. Traces back to Word2Vec 2013 - same principle, incomparably better quality
text-embedding-3-small: USD 0.02/1M tokens, 1536 dim - optimal for most backend tasks
Cosine similarity - angle between vectors, not distance. High scores need to be calibrated per domain
Batch processing up to 2048 texts per request - 400x speedup vs naive approach
Embedding model is chosen once. Changing it means recalculating the entire index
Netflix, Spotify, Tinder - same math, different domains. Embedding classification is 500x cheaper than LLM for high-volume tasks

Вопросы для размышления

Netflix stores a film as a vector of 128 numbers. What 'dimensions' might be encoded? Genre? Pace? Mood? What else?
Cosine similarity returns 0.87 for two support tickets. The deduplication threshold is 0.92. What domain context is needed before adjusting the threshold?
1 million documents x 500 tokens x USD 0.02/1M tokens = how much does it cost to build an index? Is this a one-time cost or recurring?

What's Next

Embedding generation is covered. Now a place to store and quickly search through millions of vectors is needed. A regular SQL database can't handle nearest neighbor search in a 1536-dimensional space in milliseconds - a specialized tool is required.

Vector Databases — Storing and searching millions of embeddings - pgvector, Pinecone, Qdrant
Document Processing — Extracting text from PDF, DOCX, HTML before generating embeddings

Связанные уроки

aie-03-llm-fundamentals — Embeddings come from the same transformer internals
aie-10-vector-databases — Embeddings need a vector store to scale search
aie-12-rag-fundamentals — Embeddings are the retrieval backbone of RAG
ml-35-word-embeddings — Modern text embeddings extend word2vec ideas
la-02-dot-product — Cosine similarity is a normalized dot product
alg-10-binary-search
db-30-vector