Qdrant - Vector Database

Performance Tuning

Two projects, identical collections, identical servers. First: loading 1M vectors takes 2 hours, search P99 = 150ms. Second: loading takes 8 minutes, search P99 = 12ms. The difference is entirely in settings: indexing_threshold, batch size, ef, memmap. Performance tuning isn't a trick - it's understanding how HNSW works under the hood.

**Nightly re-indexing:** disable HNSW (indexing_threshold: 0) during load → 5× speedup. 10M documents in 20 minutes instead of 2 hours
**RAG with high recall:** ef=256 + oversampling=4 for final queries, ef=64 for candidate retrieval - 99.9% accuracy at acceptable latency
**Resource savings:** Binary Quantization + on_disk float32 - 50M vectors on a 16 GB RAM server instead of the required 300 GB

Предварительные знания

indexing_threshold and Batch Upsert: Maximizing Ingestion Speed

**Ingestion performance** depends on two factors: how you write points (batch vs single inserts) and when Qdrant builds the HNSW index (`indexing_threshold`). Wrong settings can make loading 10-100× slower than necessary.

indexing_threshold	Search before threshold	Search after	When to use
0	Flat (exact, slow)	HNSW built immediately	Small collections (< 10k)
10000	Flat below 10k points	HNSW above 10k	Balance for frequent small writes
20000 (default)	Flat below 20k points	HNSW above 20k	Standard for most workloads
0 (during bulk load)	Flat (slow)	HNSW after re-enabling	Initial bulk load - max write throughput

You're loading 10M vectors into a new collection. Batch size = 100, wait = false, indexing_threshold = 20000 (default). Loading is slower than expected. What helps?

Memmap: Disk Storage with Memory-Mapped Access

**Memmap (memory-mapped files)** lets Qdrant store vectors on SSD while accessing them through virtual memory. The OS caches 'hot' pages. Result: collections larger than RAM become possible, with only a slight latency increase on cache misses.

**Disk type is critical for memmap.** NVMe SSD: latency ~0.1ms, excellent performance. SATA SSD: latency ~0.5ms, acceptable. HDD: latency ~5-10ms, unacceptable for search. If Qdrant is on HDD - disable memmap or replace the disk.

Collection: 20M vectors, memmap enabled. Queries for 'popular' documents are fast (2ms), queries for 'rare' documents are slow (50ms). Why?

ef and hnsw_ef: Tuning the Speed vs Recall Trade-off

**HNSW has two key parameters** that govern search quality and speed. `ef` (search ef) is the candidate queue size during search. `hnsw_ef` in the collection config is the default ef. Higher ef = better recall, higher latency.

**Quick production tuning recipe:** 1) Binary Quantization + `always_ram: true` - the main RAM saver. 2) `on_disk: true` for float32 - original vectors on NVMe. 3) `indexing_threshold: 0` during bulk load → `20000` after. 4) `hnsw_ef: 128` by default, tune higher for latency-sensitive queries. 5) Batch upsert 100-200, `wait: false`. This covers 90% of production use cases.

"The higher the m parameter in HNSW the better - set m=64 for maximum recall"

m=16 delivers 99%+ recall in most workloads. Increasing m to 32-64 marginally improves recall but doubles or quadruples the RAM for the HNSW graph and slows down builds. The first lever for improving recall is ef_construct and search-time ef - not m.

With m=16, each node has 16 neighbors in lower layers. Recall at ef=128 is ~98-99%. m=32 gives 99%+ but the graph is 2× heavier. Rule of thumb: m=16 by default, m=32 only if Scalar/Binary quantization with ef=256 still doesn't deliver the required recall.

Search returns 10 results in 8ms (p50) but recall@10 = 92% (you need 99%+). Collection: 5M vectors, Binary Quantization, ef=64. What should you change?

Key Takeaways

**Batch upsert** (100-200 points, wait: false) is 50-100× faster than single inserts
**indexing_threshold: 0** during bulk load disables HNSW rebuild. Restore to 20000 afterward
**Memmap** (on_disk: true) - vectors on NVMe SSD, OS caches hot pages. Requires NVMe, not HDD
**ef at search time** - the primary recall vs latency lever: ef=64 (fast), ef=128 (default), ef=256 (precise)
**Binary Quantization + on_disk float32** - the standard pattern for collections larger than available RAM

What's Next

You've completed the entire Production section. You now know how to deploy, scale, monitor, and optimize Qdrant.

Quantization — Binary and Scalar Quantization are the foundation of performance optimization
Monitoring — pending_optimizations and latency metrics are the inputs for tuning decisions
HNSW: How the Index Works — Understanding HNSW internals explains exactly why ef and m behave the way they do

Вопросы для размышления

Why does indexing_threshold: 0 speed up bulk loading but slow down search during loading? Describe what happens to data with indexing_threshold = 0 vs 20000.
Your collection has m=16, ef_construct=100. You measured recall@10 = 95% and want 99%. Which parameter should you change and why - m, ef_construct, or search-time ef? Which one requires recreating the collection?
Design the optimal Qdrant configuration for: 100M vectors at 1536-dim, 64 GB RAM, NVMe SSD 2 TB, required recall 99%, P99 latency < 50ms. List all settings and justify each choice.

Связанные уроки

alg-12-bfs

indexing_threshold and Batch Upsert: Maximizing Ingestion Speed

indexing_threshold

Search before threshold

Search after

When to use

Flat (exact, slow)

HNSW built immediately

Small collections (< 10k)

10000

Flat below 10k points

HNSW above 10k

Balance for frequent small writes

20000 (default)

Flat below 20k points

HNSW above 20k

Standard for most workloads

0 (during bulk load)

Flat (slow)

HNSW after re-enabling

Initial bulk load - max write throughput

You're loading 10M vectors into a new collection. Batch size = 100, wait = false, indexing_threshold = 20000 (default). Loading is slower than expected. What helps?

Memmap: Disk Storage with Memory-Mapped Access

Collection: 20M vectors, memmap enabled. Queries for 'popular' documents are fast (2ms), queries for 'rare' documents are slow (50ms). Why?

ef and hnsw_ef: Tuning the Speed vs Recall Trade-off

"The higher the m parameter in HNSW the better - set m=64 for maximum recall"

Search returns 10 results in 8ms (p50) but recall@10 = 92% (you need 99%+). Collection: 5M vectors, Binary Quantization, ef=64. What should you change?

Key Takeaways

**Batch upsert** (100-200 points, wait: false) is 50-100× faster than single inserts

**indexing_threshold: 0** during bulk load disables HNSW rebuild. Restore to 20000 afterward

**Memmap** (on_disk: true) - vectors on NVMe SSD, OS caches hot pages. Requires NVMe, not HDD

**ef at search time** - the primary recall vs latency lever: ef=64 (fast), ef=128 (default), ef=256 (precise)

**Binary Quantization + on_disk float32** - the standard pattern for collections larger than available RAM