AI Engineering
Embeddings: Turning Text into Vectors for Search and Comparison
Цели урока
- Understand what embeddings are and how text is transformed into a vector of numbers
- Compare embedding models and choose the right one for the task
- Master cosine similarity and other distance metrics between vectors
- Generate embeddings via API with batch optimization
- Apply embeddings for search, deduplication, classification, and anomaly detection
Netflix stores every film as a vector of 128 numbers. Spotify - every song. Tinder - every person. Same math, three domains, billions of matches. 'king - man + woman = queen' doesn't come from rules - it emerges from compressing 300 billion words into 1536 numbers. That's not magic. It's a side effect of linear algebra at scale.
- GitHub Copilot uses embeddings to find relevant code in a repository - every file becomes a vector, search works via cosine similarity
- Notion AI builds a semantic index of all user pages via embeddings - knowledge base search without keywords
- Intercom classifies 500K+ tickets per day via embedding comparison (USD 0.00002/ticket) - 500x cheaper than GPT-4o
- Stack Overflow uses embeddings to find duplicate questions - 'already asked' matches by meaning, not by words
Evolution of Embedding Technology
**Word2Vec (Mikolov, Google, 2013)** - the first model to show that words can be represented as vectors with structure. That's where the famous 'king - man + woman ≈ queen' came from - not programmed explicitly, but emerging from geometry. **GloVe (Stanford, 2014)** improved the approach through global co-occurrence statistics. **BERT embeddings (Devlin, Google, 2018)** - the shift from words to context: one token gets a different vector depending on the sentence. **OpenAI text-embedding-ada-002 (2022)** - commercialization: high quality via a simple API call. **text-embedding-3-small/large (2024)** - flexible dimensions, 5x cheaper than ada-002 with better quality.
Предварительные знания
What Are Embeddings and Why They Matter
Netflix stores every film as a vector of 128 numbers. Spotify - every song. Tinder - every person. Same math, three domains, billions of matches.
A vector of numbers - and suddenly an algorithm knows that 'The Godfather' is closer to 'Once Upon a Time in America' than to 'Shrek 2'. How? Not from explicit rules - from the structure of data. That's an embedding: **compressing meaning into coordinates of a high-dimensional space**.
Each of the 1536 dimensions encodes some aspect of meaning. Not 'topic' or 'sentiment' directly - these are abstract features learned by the model from hundreds of billions of words. The effect is clear: semantically similar texts receive similar coordinates.
Word2Vec (Mikolov, Google, 2013) was the first model to show this property. That's where the famous **king - man + woman ≈ queen** came from. Not programmed explicitly - it emerged from the geometry of vector space. GloVe (2014, Stanford) solidified the approach. BERT embeddings (2018) moved it to full sentence context. OpenAI text-embedding-ada-002 (2022) made all of this available through a single API call.
**GPS analogy:** latitude and longitude encode position on a map using two numbers. An embedding encodes a text's position in 'meaning space' - except there are not 2 dimensions, but 1536. And the distance between points is semantic closeness.
For a backend developer, embeddings are the foundation for an entire class of tasks:
- **Semantic Search** - finding documents by meaning, not keywords
- **RAG (Retrieval-Augmented Generation)** - feeding relevant context into an LLM
- **Deduplication** - finding semantic duplicates in a database
- **Clustering** - automatically grouping tickets, reviews, messages
- **Recommendations** - 'similar articles', 'similar products' based on description meaning
- **Anomaly Detection** - identifying texts that are significantly different from the rest
A text embedding is:
Models for Creating Embeddings
Embedding models are a separate class. They don't generate text - they **compress meaning into a vector**. Trained differently: the goal isn't 'predict the next token', it's 'make similar texts close in space'. Model choice affects search quality, speed, and cost.
| Model | Dimensions | Price / 1M tokens | Features |
|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | USD 0.02 | Best price/quality ratio, flexible dimensions |
| text-embedding-3-large (OpenAI) | 3072 | USD 0.13 | Maximum OpenAI quality, supports dimensions param |
| embed-v4.0 (Cohere) | 1024 | USD 0.10 | Multilingual, input_type for different tasks |
| voyage-3 (Voyage AI) | 1024 | USD 0.06 | Best for code retrieval |
| nomic-embed-text (open-source) | 768 | free | Run locally, Ollama-compatible |
| BGE-M3 (BAAI, open-source) | 1024 | free | Multilingual, dense + sparse embeddings |
Model quality is measured via the **MTEB benchmark** (Massive Text Embedding Benchmark) - the industry standard. text-embedding-3-small consistently ranks in the top 15. BGE-M3 competes with paid solutions for multilingual tasks.
OpenAI text-embedding-3 supports **flexible dimensions**: request a shorter vector (256, 512) with minimal quality loss - saving memory and speeding up search in Qdrant or pgvector.
**Critically important:** the embedding model cannot be changed after building an index. If the database is built on text-embedding-3-small, searching with a vector from text-embedding-3-large won't work - the spaces are incompatible. Migration = recalculating all embeddings.
For most backend tasks, text-embedding-3-small is the optimal choice. Processing 1 million documents of average length (500 tokens) costs ~`USD 10` for embeddings. This is a one-time operation - a vector is only recalculated when the text changes.
Why can't the embedding model be swapped after the index is built without recalculating?
Cosine Similarity: How to Measure Vector Closeness
Embeddings turn text into a vector. The next step is to **compare** two vectors. The intuitive answer: compute the distance between points. That's a trap.
Embedding models return vectors of different magnitudes for different texts. A long document will have a larger norm than a short phrase - even if the meaning is identical. Euclidean distance doesn't account for this.
**Cosine similarity** doesn't measure distance - it measures the **angle between vectors**. Vector magnitude doesn't matter. Only direction does. Result: a number from -1 to 1:
- **1.0** - vectors point in the same direction (identical meaning)
- **0.0** - vectors are orthogonal (no relationship)
- **-1.0** - vectors point in opposite directions (practically never seen with text embeddings)
| Metric | Formula | When to Use |
|---|---|---|
| Cosine Similarity | cos(θ) = dot(A,B) / (‖A‖ × ‖B‖) | Standard choice for text embeddings |
| Euclidean (L2) | ‖A - B‖₂ | When absolute vector magnitude matters |
| Dot Product | A · B | For normalized vectors (equivalent to cosine) |
| Manhattan (L1) | Σ|Aᵢ - Bᵢ| | Sparse data, less sensitive to outliers |
**Practical note:** OpenAI text-embedding-3 returns already **normalized** vectors (length = 1). For normalized vectors, cosine similarity = dot product. Dot product is one operation instead of three. That's why Qdrant, pgvector, and other vector databases use dot product internally when working with normalized vectors.
Cosine similarity between two embeddings is 0.92. This means:
Generating Embeddings: API, Batch Processing, and Optimization
In production, embeddings need to be generated for thousands and millions of documents. The naive approach: one request per document. 10,000 documents x 200ms = 33 minutes. That's not production - that's waiting.
**Batch processing** - up to 2048 texts in one request. The same 10,000 documents = 5 requests ≈ 5 seconds. A 400x difference.
| Parameter | text-embedding-3-small | text-embedding-3-large |
|---|---|---|
| Max tokens per input | 8191 | 8191 |
| Max texts per batch | 2048 | 2048 |
| Rate limit (tier 1) | 3,000 RPM / 1M TPM | 3,000 RPM / 1M TPM |
| Vector dimensions | 1536 (default) | 3072 (default) |
| Minimum dimensions | 256 | 256 |
For clustering and deduplication (comparing texts against each other) - dimensions: 256 or 512 is enough. For semantic search and RAG, full dimensions are recommended: quality visibly drops on short texts with reduced dimensions.
**Text longer than 8191 tokens is silently truncated.** The API won't throw an exception - it simply ignores the tail. Check the length before sending and split the text into chunks if needed (chunking is a separate topic covered in the chunking strategies lesson).
To generate embeddings for 10,000 documents via OpenAI API with batch size 2048, what is the minimum number of API requests required?
Practical Applications of Embeddings in Backend
Embeddings are not an academic concept. They're a practical tool that already solves real problems cheaper and faster than LLMs. Five patterns - with code.
1. Semantic Search
Keyword search matches words. Semantic search finds documents by **meaning** - even when not a single word in the query matches the document. Query 'app is slow on older phones' finds 'Performance optimization for low-end devices' - no shared words whatsoever.
2. Content Deduplication
3. Anomaly Detection
If a new text's embedding is far from all others in the collection - it's an anomaly. Useful for spam filtering, detecting irrelevant content, and monitoring.
4. Clustering
Embeddings allow automatic grouping of content without manual rules. Simply cluster the vectors - texts on the same topic will end up in the same cluster. That's how Spotify mood playlists work, and YouTube's related videos.
5. Classification by Examples (Few-shot)
**Embedding classification vs LLM classification:** The embedding approach costs `USD 0.00002` per ticket and runs in 50ms. The LLM approach (GPT-4o) costs `USD 0.01` per ticket and runs in 500ms. For high-volume tasks (thousands of tickets per day), embeddings are 500x more cost-effective.
For classifying 50,000 support tickets per day into 5 categories, the optimal approach is:
Cosine similarity = distance between vectors
Cosine similarity is the angle between vectors, not distance. High cosine doesn't guarantee semantic closeness in out-of-domain tasks
Cosine similarity measures cos(theta) - the angle between directions. Two vectors can be 'close in direction' (score 0.85), yet represent texts from different domains where that closeness is meaningless. A score of 0.8 for cooking texts and a score of 0.8 for legal documents are completely different things. Thresholds must be calibrated per domain.
The embedding model can be chosen and swapped later - it's just an API
Changing the model = recalculating the entire index. Different models create incompatible vector spaces
text-embedding-3-small and text-embedding-3-large are different spaces, even if dimensions match after the dimensions parameter. The same text gets different coordinates in different spaces. Comparing a vector from one model with a vector from another is like comparing GPS coordinates with screen pixels.
Key Takeaways
- Embedding - a vector of 1536 numbers encoding meaning. Traces back to Word2Vec 2013 - same principle, incomparably better quality
- text-embedding-3-small: USD 0.02/1M tokens, 1536 dim - optimal for most backend tasks
- Cosine similarity - angle between vectors, not distance. High scores need to be calibrated per domain
- Batch processing up to 2048 texts per request - 400x speedup vs naive approach
- Embedding model is chosen once. Changing it means recalculating the entire index
- Netflix, Spotify, Tinder - same math, different domains. Embedding classification is 500x cheaper than LLM for high-volume tasks
Вопросы для размышления
- Netflix stores a film as a vector of 128 numbers. What 'dimensions' might be encoded? Genre? Pace? Mood? What else?
- Cosine similarity returns 0.87 for two support tickets. The deduplication threshold is 0.92. What domain context is needed before adjusting the threshold?
- 1 million documents x 500 tokens x USD 0.02/1M tokens = how much does it cost to build an index? Is this a one-time cost or recurring?
What's Next
Embedding generation is covered. Now a place to store and quickly search through millions of vectors is needed. A regular SQL database can't handle nearest neighbor search in a 1536-dimensional space in milliseconds - a specialized tool is required.
- Vector Databases — Storing and searching millions of embeddings - pgvector, Pinecone, Qdrant
- Document Processing — Extracting text from PDF, DOCX, HTML before generating embeddings
Связанные уроки
- aie-03-llm-fundamentals — Embeddings come from the same transformer internals
- aie-10-vector-databases — Embeddings need a vector store to scale search
- aie-12-rag-fundamentals — Embeddings are the retrieval backbone of RAG
- ml-35-word-embeddings — Modern text embeddings extend word2vec ideas
- la-02-dot-product — Cosine similarity is a normalized dot product
- alg-10-binary-search
- db-30-vector