Qdrant - Vector Database

Matryoshka Embeddings (MRL)

1 million documents in Qdrant: direct 1536-dim search = 180ms, 6 GB RAM. Two-stage with MRL (256→1536-dim): 35ms, 98% recall. You get nearly the same accuracy at 5x the speed. GPT-4 would take longer to respond than you spent on the search.

  • **Enterprise knowledge base:** 10M documents, SLA < 100ms. Two-stage MRL: 256-dim in 15ms (candidates) + 1536-dim rerank in 20ms = 35ms total at 98% recall
  • **Mobile/edge search:** 256-dim MRL vectors are 1KB instead of 6KB - critical for on-device storage
  • **Cost optimisation:** OpenAI bills per embedding token. Storing 256-dim instead of 1536-dim = 6x less vector storage, 6x less RAM for the HNSW index

Предварительные знания

  • Named Vectors: Multiple Embeddings
  • Vector Quantization

Matryoshka Representation Learning: the idea and supported models

**Matryoshka Representation Learning (MRL)** is a training technique for embedding models where smaller vector prefixes retain semantic information. Like a Russian nesting doll: a smaller, meaningful vector is nested inside a larger one. Ordinary model: 1536-dim vector - the first 256 numbers carry no meaning without the rest. MRL model: the first 256 numbers already carry meaning (lower quality than 1536, but usable), first 512 - better, all 1536 - maximum quality. Models that support MRL: - `text-embedding-3-small` and `text-embedding-3-large` from OpenAI (the `dimensions` parameter) - `nomic-embed-text-v1.5` from Nomic AI - `mxbai-embed-large-v1` from Mixedbread - `jina-embeddings-v3` from Jina AI

**Why MRL works:** during MRL training a special loss function forces the model to encode the most important semantic information in the first dimensions. Standard models (ada-002, all-MiniLM) were not trained this way - truncating their vectors significantly degrades quality. Before using MRL, verify that the model explicitly supports it.

You are using text-embedding-ada-002. Can you truncate its 1536-dim vector to 256-dim to save memory, using the same technique as for text-embedding-3-small?

MRL in Qdrant: named vectors and collections

Qdrant does not have built-in MRL support as a separate index type. MRL is implemented through existing mechanisms: **Approach 1: Named vectors** - store multiple truncated vectors of different dimensions in one point. Searching at a given dimension = searching the corresponding named vector. **Approach 2: Separate collections** - `docs-256`, `docs-512`, `docs-1536`. Each collection stores vectors of one dimension. Search first in `docs-256`, then re-rank against `docs-1536`. Each approach has its own trade-offs.

ParameterNamed vectorsSeparate collections
Storage256+1536 = 1,792 floats/point256 in one + 1536 in another
ConsistencyAutomatic - one upsertManual - two upserts
Search by dimvector: { name: 'small', ... }client.search('docs-256', ...)
QuantizationDifferent per named vectorDifferent per collection
RecommendationSimplicity + less codeFlexibility + separate configs

You are using named vectors ('small' 256-dim and 'full' 1536-dim). You want to apply scalar quantization to the 'full' vector but NOT to 'small'. How do you do this?

Two-stage retrieval: candidates + reranking

**Two-stage retrieval** with MRL is the key pattern for production systems: 1. **Stage 1 (Candidate Retrieval):** fast search using small dimensionality (256-dim). Retrieves top-N candidates. Speed: ~6x faster than searching with 1536-dim. 2. **Stage 2 (Reranking):** refine the ranking of the top-N candidates using full dimensionality (1536-dim). High accuracy, but only over a small set. Result: speed close to 256-dim, accuracy close to 1536-dim. **~6x speedup at ~98% recall** compared to direct 1536-dim search.

**Memory savings + speed:** for Stage 1 (256-dim) apply scalar int8 quantization → additional 4x memory savings on the candidate index. For Stage 2 (1536-dim): scalar quantization with always_ram: false → most accurate reranking with minimal RAM consumption. Combining MRL two-stage + quantization is the optimal solution for production systems with constrained resources.

'MRL is just truncating a vector - any model will work'

MRL requires a specially trained model (matryoshka training). Truncating a regular vector (ada-002, all-MiniLM) causes significant quality loss. Only models with explicit MRL support (text-embedding-3-*, nomic-embed-v1.5) work correctly with truncated vectors.

In standard embedding training: vector dimensions distribute semantic information evenly - every dimension matters, order is not meaningful. In MRL training: a special loss function forces the model to encode the most important information in the first N dimensions. This is a fundamental difference in the training process, not a post-processing step.

Two-stage search: Stage 1 returns 50 candidates, Stage 2 reranks to top-10. Recall@10 = 98%. This means...

Summary

  • **MRL (Matryoshka Representation Learning):** the model is trained so that the first N dimensions of the vector are meaningful. Supported by: text-embedding-3-small/large (dimensions parameter), nomic-embed-v1.5, mxbai-embed-large-v1.
  • **256-dim:** 93.4% quality at 6x smaller size. 512-dim: 96.7%. The speed/quality balance depends on your use case.
  • **Named vectors in Qdrant:** one point with 'small' (256-dim) and 'full' (1536-dim) vectors. One collection, two vectors per point.
  • **Two-stage retrieval:** Stage 1 - fast search for top-N × 5 candidates at 256-dim; Stage 2 - rerank at 1536-dim with filter: has_id. Result: ~5x speedup, ~98% recall.
  • **candidateMultiplier trade-off:** 3 = faster, 5 = optimal, 10 = more accurate. Choose based on your latency vs recall SLA.

What's next

You have completed the full advanced Qdrant course. MRL + two-stage retrieval is the pinnacle of vector search optimisation. Next step: apply everything you have learned in a comprehensive production project.

  • Multi-vector Search — Named vectors are the mechanism that MRL is built on in Qdrant. Deeper understanding of multi-vector will improve your use of MRL
  • Quantization — MRL + quantization for Stage 1 candidates = maximum speed at minimum RAM
  • Production RAG Pipeline — Embed MRL two-stage retrieval into a RAG pipeline for the optimal quality/speed balance

Вопросы для размышления

  • How do you measure the recall of your two-stage search on real data? How do you build ground truth for comparison against direct 1536-dim search?
  • candidateMultiplier = 5 means 5x more Stage 2 compute compared to direct search. At what value does two-stage become slower than direct 1536-dim search?
  • MRL with separate collections (docs-256 and docs-1536) vs named vectors - how do you implement zero-downtime indexing of new documents into both stores atomically?

Связанные уроки

  • la-15-svd
Matryoshka Embeddings (MRL)

0

1

Sign In