Qdrant - Vector Database

Performance Tuning

Two projects, identical collections, identical servers. First: loading 1M vectors takes 2 hours, search P99 = 150ms. Second: loading takes 8 minutes, search P99 = 12ms. The difference is entirely in settings: indexing_threshold, batch size, ef, memmap. Performance tuning isn't a trick - it's understanding how HNSW works under the hood.

  • **Nightly re-indexing:** disable HNSW (indexing_threshold: 0) during load → 5× speedup. 10M documents in 20 minutes instead of 2 hours
  • **RAG with high recall:** ef=256 + oversampling=4 for final queries, ef=64 for candidate retrieval - 99.9% accuracy at acceptable latency
  • **Resource savings:** Binary Quantization + on_disk float32 - 50M vectors on a 16 GB RAM server instead of the required 300 GB

Предварительные знания

  • Vector Quantization
  • Monitoring and Metrics

indexing_threshold and Batch Upsert: Maximizing Ingestion Speed

**Ingestion performance** depends on two factors: how you write points (batch vs single inserts) and when Qdrant builds the HNSW index (`indexing_threshold`). Wrong settings can make loading 10-100× slower than necessary.

indexing_thresholdSearch before thresholdSearch afterWhen to use
0Flat (exact, slow)HNSW built immediatelySmall collections (< 10k)
10000Flat below 10k pointsHNSW above 10kBalance for frequent small writes
20000 (default)Flat below 20k pointsHNSW above 20kStandard for most workloads
0 (during bulk load)Flat (slow)HNSW after re-enablingInitial bulk load - max write throughput

You're loading 10M vectors into a new collection. Batch size = 100, wait = false, indexing_threshold = 20000 (default). Loading is slower than expected. What helps?

Memmap: Disk Storage with Memory-Mapped Access

**Memmap (memory-mapped files)** lets Qdrant store vectors on SSD while accessing them through virtual memory. The OS caches 'hot' pages. Result: collections larger than RAM become possible, with only a slight latency increase on cache misses.

**Disk type is critical for memmap.** NVMe SSD: latency ~0.1ms, excellent performance. SATA SSD: latency ~0.5ms, acceptable. HDD: latency ~5-10ms, unacceptable for search. If Qdrant is on HDD - disable memmap or replace the disk.

Collection: 20M vectors, memmap enabled. Queries for 'popular' documents are fast (2ms), queries for 'rare' documents are slow (50ms). Why?

ef and hnsw_ef: Tuning the Speed vs Recall Trade-off

**HNSW has two key parameters** that govern search quality and speed. `ef` (search ef) is the candidate queue size during search. `hnsw_ef` in the collection config is the default ef. Higher ef = better recall, higher latency.

**Quick production tuning recipe:** 1) Binary Quantization + `always_ram: true` - the main RAM saver. 2) `on_disk: true` for float32 - original vectors on NVMe. 3) `indexing_threshold: 0` during bulk load → `20000` after. 4) `hnsw_ef: 128` by default, tune higher for latency-sensitive queries. 5) Batch upsert 100-200, `wait: false`. This covers 90% of production use cases.

"The higher the m parameter in HNSW the better - set m=64 for maximum recall"

m=16 delivers 99%+ recall in most workloads. Increasing m to 32-64 marginally improves recall but doubles or quadruples the RAM for the HNSW graph and slows down builds. The first lever for improving recall is ef_construct and search-time ef - not m.

With m=16, each node has 16 neighbors in lower layers. Recall at ef=128 is ~98-99%. m=32 gives 99%+ but the graph is 2× heavier. Rule of thumb: m=16 by default, m=32 only if Scalar/Binary quantization with ef=256 still doesn't deliver the required recall.

Search returns 10 results in 8ms (p50) but recall@10 = 92% (you need 99%+). Collection: 5M vectors, Binary Quantization, ef=64. What should you change?

Key Takeaways

  • **Batch upsert** (100-200 points, wait: false) is 50-100× faster than single inserts
  • **indexing_threshold: 0** during bulk load disables HNSW rebuild. Restore to 20000 afterward
  • **Memmap** (on_disk: true) - vectors on NVMe SSD, OS caches hot pages. Requires NVMe, not HDD
  • **ef at search time** - the primary recall vs latency lever: ef=64 (fast), ef=128 (default), ef=256 (precise)
  • **Binary Quantization + on_disk float32** - the standard pattern for collections larger than available RAM

What's Next

You've completed the entire Production section. You now know how to deploy, scale, monitor, and optimize Qdrant.

  • Quantization — Binary and Scalar Quantization are the foundation of performance optimization
  • Monitoring — pending_optimizations and latency metrics are the inputs for tuning decisions
  • HNSW: How the Index Works — Understanding HNSW internals explains exactly why ef and m behave the way they do

Вопросы для размышления

  • Why does indexing_threshold: 0 speed up bulk loading but slow down search during loading? Describe what happens to data with indexing_threshold = 0 vs 20000.
  • Your collection has m=16, ef_construct=100. You measured recall@10 = 95% and want 99%. Which parameter should you change and why - m, ef_construct, or search-time ef? Which one requires recreating the collection?
  • Design the optimal Qdrant configuration for: 100M vectors at 1536-dim, 64 GB RAM, NVMe SSD 2 TB, required recall 99%, P99 latency < 50ms. List all settings and justify each choice.

Связанные уроки

  • alg-12-bfs
Performance Tuning

0

1

Sign In