Qdrant - Vector Database
Payload Indexes
1M documents. Semantic search runs in 5ms. But add a filter for 'published articles from 2024 only' - and the query takes 300ms. Reason: full payload scan. One createPayloadIndex call - back to 5ms.
- **E-commerce:** filters by category + price range + availability - all three fields must be indexed
- **News aggregator:** filtering by publication date (datetime) + language (keyword) + topic (keyword) - standard setup
- **Geo service:** searching for similar places within 5 km radius - geo index is mandatory
Предварительные знания
Why payload indexes exist
A **payload index** is a data structure that lets Qdrant quickly filter points by payload field values. Without an index, every filtering query requires a **full scan** - iterating over every point in the collection.
**When to create indexes:** for any field used in regular filters. Qdrant creates indexes explicitly through the API - this means rarely queried fields stay unindexed.
**Indexes do not block requests.** Creating an index on a populated collection triggers a background build. Until it's done, filters work via full scan (slow); after - via the index (fast). Progress can be tracked via the `/collections/{name}` API.
Collection: 500k points. A filter on 'status' (values: 'active', 'archived') is used in 80% of queries. No index exists. What happens?
Index types: keyword, integer, float, geo, text, datetime
**Qdrant supports 6 payload index types**, each optimized for its data type and query pattern.
| Index type | Data type | Supported filters | Example fields |
|---|---|---|---|
| keyword | string | match, is_null, is_empty | category, status, language, tag |
| integer | int64 | match, range (gte/lte/gt/lt) | year, user_id, view_count, priority |
| float | float64 | range (gte/lte/gt/lt) | price, score, latitude, longitude |
| geo | { lon, lat } | geo_bounding_box, geo_radius, geo_polygon | location, coordinates |
| text | string (full-text) | match.text (full-text search) | description, content, title |
| datetime | RFC3339 string | range (gte/lte/gt/lt) | created_at, published_at, expires_at |
Task: search restaurants by cuisine (semantically, via vector), within a 3 km radius, with rating >= 4.5, currently open (is_open = true). Which indexes are needed?
Creating indexes: a practical example
**Full workflow:** creating a collection, adding points, creating indexes, filtered search. Indexes are created after adding data - Qdrant builds them in the background.
**Index only what is needed.** Each index takes additional memory (~50-200MB per 1M points) and slows down writes. A good rule: index fields used in filters in >20% of queries. Rare filters - let them do a full scan.
Creating indexes on all payload fields 'just to be safe'
Index only filter fields. Excess indexes: +memory, -write speed, zero benefit for search
Every index must be maintained on every write (upsert). With 20 indexes instead of 5 - upserts are 4x slower. Plus each index uses RAM. Rule: index only if the field appears in filters
1M points were added, then an index on 'category' was created. At that moment a query with a category filter arrives. What happens?
Key Ideas
- **Without index = full scan:** every filter iterates all points O(N). With index - O(log N)
- **6 types:** keyword (categories), integer/float (numbers/ranges), geo (geolocation), text (full-text), datetime (time)
- **createPayloadIndex** is created after data, works online - doesn't block queries
- **Index conservatively:** only fields used in filters. Each extra index = memory + slower writes
- **Check indexes:** `getCollection` → `payload_schema`
What's next
Payload indexes speed up filtering. The next step - sparse vectors for lexical search, which works alongside semantic search.
- Sparse Vectors: BM42 and SPLADE — Lexical search complements semantic - together this is Hybrid Search
- Vector Quantization — Memory compression - important when many indexes take up RAM
- HNSW: How the Index Works — HNSW + payload indexes work together for filtered vector search
Вопросы для размышления
- How does Qdrant decide whether to use a payload index or HNSW first in filtered vector search? What happens when a filter is highly selective?
- The collection stores JSON with nested objects (article.author.country). How is an index created on a nested field?
- If 95% of queries filter by 'is_active: true' but only 10% of points are active - is it still worth creating a keyword index?