Qdrant - Vector Database
Performance Tuning
Two projects, identical collections, identical servers. First: loading 1M vectors takes 2 hours, search P99 = 150ms. Second: loading takes 8 minutes, search P99 = 12ms. The difference is entirely in settings: indexing_threshold, batch size, ef, memmap. Performance tuning isn't a trick - it's understanding how HNSW works under the hood.
- **Nightly re-indexing:** disable HNSW (indexing_threshold: 0) during load → 5× speedup. 10M documents in 20 minutes instead of 2 hours
- **RAG with high recall:** ef=256 + oversampling=4 for final queries, ef=64 for candidate retrieval - 99.9% accuracy at acceptable latency
- **Resource savings:** Binary Quantization + on_disk float32 - 50M vectors on a 16 GB RAM server instead of the required 300 GB
Предварительные знания
indexing_threshold and Batch Upsert: Maximizing Ingestion Speed
**Ingestion performance** depends on two factors: how you write points (batch vs single inserts) and when Qdrant builds the HNSW index (`indexing_threshold`). Wrong settings can make loading 10-100× slower than necessary.
| indexing_threshold | Search before threshold | Search after | When to use |
|---|---|---|---|
| 0 | Flat (exact, slow) | HNSW built immediately | Small collections (< 10k) |
| 10000 | Flat below 10k points | HNSW above 10k | Balance for frequent small writes |
| 20000 (default) | Flat below 20k points | HNSW above 20k | Standard for most workloads |
| 0 (during bulk load) | Flat (slow) | HNSW after re-enabling | Initial bulk load - max write throughput |
You're loading 10M vectors into a new collection. Batch size = 100, wait = false, indexing_threshold = 20000 (default). Loading is slower than expected. What helps?
Memmap: Disk Storage with Memory-Mapped Access
**Memmap (memory-mapped files)** lets Qdrant store vectors on SSD while accessing them through virtual memory. The OS caches 'hot' pages. Result: collections larger than RAM become possible, with only a slight latency increase on cache misses.
**Disk type is critical for memmap.** NVMe SSD: latency ~0.1ms, excellent performance. SATA SSD: latency ~0.5ms, acceptable. HDD: latency ~5-10ms, unacceptable for search. If Qdrant is on HDD - disable memmap or replace the disk.
Collection: 20M vectors, memmap enabled. Queries for 'popular' documents are fast (2ms), queries for 'rare' documents are slow (50ms). Why?
ef and hnsw_ef: Tuning the Speed vs Recall Trade-off
**HNSW has two key parameters** that govern search quality and speed. `ef` (search ef) is the candidate queue size during search. `hnsw_ef` in the collection config is the default ef. Higher ef = better recall, higher latency.
**Quick production tuning recipe:** 1) Binary Quantization + `always_ram: true` - the main RAM saver. 2) `on_disk: true` for float32 - original vectors on NVMe. 3) `indexing_threshold: 0` during bulk load → `20000` after. 4) `hnsw_ef: 128` by default, tune higher for latency-sensitive queries. 5) Batch upsert 100-200, `wait: false`. This covers 90% of production use cases.
"The higher the m parameter in HNSW the better - set m=64 for maximum recall"
m=16 delivers 99%+ recall in most workloads. Increasing m to 32-64 marginally improves recall but doubles or quadruples the RAM for the HNSW graph and slows down builds. The first lever for improving recall is ef_construct and search-time ef - not m.
With m=16, each node has 16 neighbors in lower layers. Recall at ef=128 is ~98-99%. m=32 gives 99%+ but the graph is 2× heavier. Rule of thumb: m=16 by default, m=32 only if Scalar/Binary quantization with ef=256 still doesn't deliver the required recall.
Search returns 10 results in 8ms (p50) but recall@10 = 92% (you need 99%+). Collection: 5M vectors, Binary Quantization, ef=64. What should you change?
Key Takeaways
- **Batch upsert** (100-200 points, wait: false) is 50-100× faster than single inserts
- **indexing_threshold: 0** during bulk load disables HNSW rebuild. Restore to 20000 afterward
- **Memmap** (on_disk: true) - vectors on NVMe SSD, OS caches hot pages. Requires NVMe, not HDD
- **ef at search time** - the primary recall vs latency lever: ef=64 (fast), ef=128 (default), ef=256 (precise)
- **Binary Quantization + on_disk float32** - the standard pattern for collections larger than available RAM
What's Next
You've completed the entire Production section. You now know how to deploy, scale, monitor, and optimize Qdrant.
- Quantization — Binary and Scalar Quantization are the foundation of performance optimization
- Monitoring — pending_optimizations and latency metrics are the inputs for tuning decisions
- HNSW: How the Index Works — Understanding HNSW internals explains exactly why ef and m behave the way they do
Вопросы для размышления
- Why does indexing_threshold: 0 speed up bulk loading but slow down search during loading? Describe what happens to data with indexing_threshold = 0 vs 20000.
- Your collection has m=16, ef_construct=100. You measured recall@10 = 95% and want 99%. Which parameter should you change and why - m, ef_construct, or search-time ef? Which one requires recreating the collection?
- Design the optimal Qdrant configuration for: 100M vectors at 1536-dim, 64 GB RAM, NVMe SSD 2 TB, required recall 99%, P99 latency < 50ms. List all settings and justify each choice.