Qdrant - Vector Database
Matryoshka Embeddings (MRL)
1 million documents in Qdrant: direct 1536-dim search = 180ms, 6 GB RAM. Two-stage with MRL (256→1536-dim): 35ms, 98% recall. You get nearly the same accuracy at 5x the speed. GPT-4 would take longer to respond than you spent on the search.
- **Enterprise knowledge base:** 10M documents, SLA < 100ms. Two-stage MRL: 256-dim in 15ms (candidates) + 1536-dim rerank in 20ms = 35ms total at 98% recall
- **Mobile/edge search:** 256-dim MRL vectors are 1KB instead of 6KB - critical for on-device storage
- **Cost optimisation:** OpenAI bills per embedding token. Storing 256-dim instead of 1536-dim = 6x less vector storage, 6x less RAM for the HNSW index
Предварительные знания
Matryoshka Representation Learning: the idea and supported models
**Matryoshka Representation Learning (MRL)** is a training technique for embedding models where smaller vector prefixes retain semantic information. Like a Russian nesting doll: a smaller, meaningful vector is nested inside a larger one. Ordinary model: 1536-dim vector - the first 256 numbers carry no meaning without the rest. MRL model: the first 256 numbers already carry meaning (lower quality than 1536, but usable), first 512 - better, all 1536 - maximum quality. Models that support MRL: - `text-embedding-3-small` and `text-embedding-3-large` from OpenAI (the `dimensions` parameter) - `nomic-embed-text-v1.5` from Nomic AI - `mxbai-embed-large-v1` from Mixedbread - `jina-embeddings-v3` from Jina AI
**Why MRL works:** during MRL training a special loss function forces the model to encode the most important semantic information in the first dimensions. Standard models (ada-002, all-MiniLM) were not trained this way - truncating their vectors significantly degrades quality. Before using MRL, verify that the model explicitly supports it.
You are using text-embedding-ada-002. Can you truncate its 1536-dim vector to 256-dim to save memory, using the same technique as for text-embedding-3-small?
MRL in Qdrant: named vectors and collections
Qdrant does not have built-in MRL support as a separate index type. MRL is implemented through existing mechanisms: **Approach 1: Named vectors** - store multiple truncated vectors of different dimensions in one point. Searching at a given dimension = searching the corresponding named vector. **Approach 2: Separate collections** - `docs-256`, `docs-512`, `docs-1536`. Each collection stores vectors of one dimension. Search first in `docs-256`, then re-rank against `docs-1536`. Each approach has its own trade-offs.
| Parameter | Named vectors | Separate collections |
|---|---|---|
| Storage | 256+1536 = 1,792 floats/point | 256 in one + 1536 in another |
| Consistency | Automatic - one upsert | Manual - two upserts |
| Search by dim | vector: { name: 'small', ... } | client.search('docs-256', ...) |
| Quantization | Different per named vector | Different per collection |
| Recommendation | Simplicity + less code | Flexibility + separate configs |
You are using named vectors ('small' 256-dim and 'full' 1536-dim). You want to apply scalar quantization to the 'full' vector but NOT to 'small'. How do you do this?
Two-stage retrieval: candidates + reranking
**Two-stage retrieval** with MRL is the key pattern for production systems: 1. **Stage 1 (Candidate Retrieval):** fast search using small dimensionality (256-dim). Retrieves top-N candidates. Speed: ~6x faster than searching with 1536-dim. 2. **Stage 2 (Reranking):** refine the ranking of the top-N candidates using full dimensionality (1536-dim). High accuracy, but only over a small set. Result: speed close to 256-dim, accuracy close to 1536-dim. **~6x speedup at ~98% recall** compared to direct 1536-dim search.
**Memory savings + speed:** for Stage 1 (256-dim) apply scalar int8 quantization → additional 4x memory savings on the candidate index. For Stage 2 (1536-dim): scalar quantization with always_ram: false → most accurate reranking with minimal RAM consumption. Combining MRL two-stage + quantization is the optimal solution for production systems with constrained resources.
'MRL is just truncating a vector - any model will work'
MRL requires a specially trained model (matryoshka training). Truncating a regular vector (ada-002, all-MiniLM) causes significant quality loss. Only models with explicit MRL support (text-embedding-3-*, nomic-embed-v1.5) work correctly with truncated vectors.
In standard embedding training: vector dimensions distribute semantic information evenly - every dimension matters, order is not meaningful. In MRL training: a special loss function forces the model to encode the most important information in the first N dimensions. This is a fundamental difference in the training process, not a post-processing step.
Two-stage search: Stage 1 returns 50 candidates, Stage 2 reranks to top-10. Recall@10 = 98%. This means...
Summary
- **MRL (Matryoshka Representation Learning):** the model is trained so that the first N dimensions of the vector are meaningful. Supported by: text-embedding-3-small/large (dimensions parameter), nomic-embed-v1.5, mxbai-embed-large-v1.
- **256-dim:** 93.4% quality at 6x smaller size. 512-dim: 96.7%. The speed/quality balance depends on your use case.
- **Named vectors in Qdrant:** one point with 'small' (256-dim) and 'full' (1536-dim) vectors. One collection, two vectors per point.
- **Two-stage retrieval:** Stage 1 - fast search for top-N × 5 candidates at 256-dim; Stage 2 - rerank at 1536-dim with filter: has_id. Result: ~5x speedup, ~98% recall.
- **candidateMultiplier trade-off:** 3 = faster, 5 = optimal, 10 = more accurate. Choose based on your latency vs recall SLA.
What's next
You have completed the full advanced Qdrant course. MRL + two-stage retrieval is the pinnacle of vector search optimisation. Next step: apply everything you have learned in a comprehensive production project.
- Multi-vector Search — Named vectors are the mechanism that MRL is built on in Qdrant. Deeper understanding of multi-vector will improve your use of MRL
- Quantization — MRL + quantization for Stage 1 candidates = maximum speed at minimum RAM
- Production RAG Pipeline — Embed MRL two-stage retrieval into a RAG pipeline for the optimal quality/speed balance
Вопросы для размышления
- How do you measure the recall of your two-stage search on real data? How do you build ground truth for comparison against direct 1536-dim search?
- candidateMultiplier = 5 means 5x more Stage 2 compute compared to direct search. At what value does two-stage become slower than direct 1536-dim search?
- MRL with separate collections (docs-256 and docs-1536) vs named vectors - how do you implement zero-downtime indexing of new documents into both stores atomically?