Qdrant - Vector Database

Scroll and Batch Operations

You need to update the embedding model for 5 million points. Naive approach: getPoint → generate embedding → upsert (with payload!) for every point. Correct approach: scroll → updateVectors (payload untouched). Difference: 3x less data over the network and twice as fast.

**Full collection export:** scroll over the entire collection to create a data dump or verify data after snapshot recovery
**Bulk re-embedding:** switching the embedding model without recreating the collection - scroll + updateVectors preserves payload and minimises network traffic
**Batch payload updates:** bulk status changes, tag updates, metadata patches via batch set_payload without touching vectors

Предварительные знания

Points, Vectors, Payloads

Scroll API: iterating over the entire collection

The **Scroll API** is for paginated full traversal of a collection. Unlike `search` (ANN search), scroll does not require a query vector and returns points in order of their IDs. Use it when you need *all* points, not just the top-N most similar ones: - Export the entire collection - Verification after snapshot recovery - Batch re-embedding (update vectors for all points) - Analytics over payload fields Pagination in scroll works through an **offset** (the ID of the last retrieved point), not `skip`/`page`.

**Scroll vs Search:** `search` uses the HNSW index (ANN) - fast, but returns only the top-N closest vectors. `scroll` does a full scan - slower, but guaranteed to return every point. Scroll cannot sort by score - only by ID. To sort by a payload field, use a filter + scroll with client-side sorting, or the `order_by` parameter (Qdrant 1.8+).

You want to find all documents belonging to userId='u-123' and recalculate their embeddings. The collection has 10 million points, ~5,000 of which belong to this user. What should you use?

Batch API: multiple operations in one request

The **Batch API** lets you combine multiple point-modification operations into a single HTTP request. Instead of 10 separate requests - one batch. Qdrant supports the following operation types in a batch: - `upsert` - create or fully replace points - `delete` - delete points by ID or filter - `set_payload` - add/update payload fields (merge) - `overwrite_payload` - replace the entire payload - `delete_payload` - remove specific payload keys - `clear_payload` - clear all payload - `update_vectors` - update vectors without recreating the point - `delete_vectors` - delete named vectors

**Batch != transaction.** Operations in a batch execute sequentially on the server but are not atomic. If the second operation in a batch fails, the first one is already applied. For atomic updates of multiple points use a single `upsert` with an array - it is atomic (either all points are written or none are).

You need to atomically: add 50 new points AND delete 30 old ones. If one operation fails, the other must also roll back. How do you do this?

Update Vectors: update a vector without recreating the point

The **Update Vectors API** lets you update only the vector(s) of a point without touching the payload. This is important for: 1. **Re-embedding** - switching models (text-embedding-ada-002 → text-embedding-3-small) 2. **Named vectors** - update only one vector out of several (e.g., only `title_vector`, leaving `body_vector` intact) 3. **Performance** - upsert fully replaces a point (requires passing all data); update_vectors updates only the vector

**Re-embedding in production:** when changing models you may need to recreate the collection with a new vector size (text-embedding-3-small = 1536 dims, ada-002 = 1536 dims match numerically but have different semantics). Zero-downtime strategy: create a new collection → re-embed all points → atomically switch the alias (Qdrant Collection Aliases API) → delete the old collection.

You have a collection with named vectors 'sparse' and 'dense'. You need to update only the 'dense' vector for 1,000 points. Which method is most efficient?

Summary

**Scroll API** - paginated full traversal of the collection using an offset (point ID). Supports filters. Scroll vs Search: scroll = all points, search = top-N by similarity.
**Batch API** - multiple operations in one HTTP request: upsert, delete, set_payload, overwrite_payload, delete_payload, clear_payload, update_vectors, delete_vectors.
**Batch != transaction** - operations execute sequentially, not atomically. A single upsert of an array is atomic.
**Update Vectors API** - update only vector(s), payload is preserved. For named vectors: update only the desired vector out of several.
**Re-embedding flow:** scroll(with_payload) → generate new embeddings → updateVectors (ID + vector only). No need to pass payload.

What's next

You've learned bulk data operations. Next step - security: API keys, JWT, and TLS for production deployments.

Security & Auth — Protect Scroll and Batch APIs from unauthorized access using API keys and JWT
Snapshots — Scroll + snapshot - a combination for backup and data verification
Points and Vectors — The foundational concepts of points and vectors that Scroll and Batch operate on

Вопросы для размышления

How would you implement zero-downtime re-embedding: the old collection keeps serving requests while the new one is being populated. When do you switch?
scroll with limit: 1000 is slower than limit: 100 × 10 pages. Why? How do you choose the optimal page size?
batch set_payload updates 10,000 points. How do you verify that all updates were applied? How would you write a verification step?

Связанные уроки

db-09-indexes-btree

Scroll API: iterating over the entire collection

You want to find all documents belonging to userId='u-123' and recalculate their embeddings. The collection has 10 million points, ~5,000 of which belong to this user. What should you use?

Batch API: multiple operations in one request

You need to atomically: add 50 new points AND delete 30 old ones. If one operation fails, the other must also roll back. How do you do this?

Update Vectors: update a vector without recreating the point

You have a collection with named vectors 'sparse' and 'dense'. You need to update only the 'dense' vector for 1,000 points. Which method is most efficient?

Summary

**Scroll API** - paginated full traversal of the collection using an offset (point ID). Supports filters. Scroll vs Search: scroll = all points, search = top-N by similarity.

**Batch API** - multiple operations in one HTTP request: upsert, delete, set_payload, overwrite_payload, delete_payload, clear_payload, update_vectors, delete_vectors.

**Batch != transaction** - operations execute sequentially, not atomically. A single upsert of an array is atomic.

**Update Vectors API** - update only vector(s), payload is preserved. For named vectors: update only the desired vector out of several.

**Re-embedding flow:** scroll(with_payload) → generate new embeddings → updateVectors (ID + vector only). No need to pass payload.