Qdrant - Vector Database

Replication and Fault Tolerance

A node goes down at 3 AM. Are your users seeing errors or not? The answer is not determined by how fast the on-call engineer responds - it's determined by how replication_factor was configured when the collection was created. In distributed systems, fault tolerance is an architectural decision made upfront.

**Production readiness:** replication_factor: 2, write_consistency: 1 - the baseline. Survives simultaneous failure of 1 node without losing availability
**E-commerce with transactions:** write_consistency: quorum guarantees that a cart item won't disappear after a page refresh
**Multi-region HA:** replication_factor: 3 across 3 data centers - survives a full data center outage

Предварительные знания

Distributed Mode and Sharding

Replication Factor: Shard Copies Across Multiple Nodes

**Replication** in Qdrant means copying each shard to multiple nodes. `replication_factor: 2` means every shard exists on 2 nodes. If one node goes down, the data remains available on the other.

**Scenarios with different replication_factor values:**

replication_factor	Nodes in cluster	Survives failure of	Write overhead
1	1+	0 nodes (no fault tolerance)	Minimal
2	2+	1 of 2 nodes	×2 on writes
3	3+	1 of 3 nodes (or 2 with strong consistency)	×3 on writes
3 + write_consistency: 2	3+	1 node, writes confirmed by 2 replicas	×3 writes, quorum

Cluster: 3 nodes, shard_number: 3, replication_factor: 2. One node (node2) fails. How many shards now have only 1 replica?

Raft: Consensus for Cluster Metadata

**Raft** is the consensus algorithm Qdrant uses to synchronize cluster metadata: node list, collections, shard configuration. Raft guarantees all nodes share the same view of cluster state.

**Quorum and availability:** when quorum is lost (> N/2 nodes fail), Qdrant cannot change metadata (create collections, modify config). However **data reads and writes** continue working on the remaining nodes - depending on `write_consistency_factor`.

**Node count and Raft fault tolerance:** 2 nodes - quorum is 2, if 1 fails - no leader (no schema changes). 3 nodes - tolerates 1 node failure. 5 nodes - tolerates 2 node failures. An odd number of nodes is always better than even for Raft: 4 nodes tolerate only 1 failure (same as 3 nodes) - why pay for the 4th?

A 4-node cluster. 2 nodes fail simultaneously. What happens to Qdrant?

Consistency vs Availability: CAP in Qdrant

**The CAP theorem** applies to distributed Qdrant: you cannot simultaneously guarantee both Consistency and Availability during a network partition. Qdrant gives you control through `write_consistency_factor` and `read_consistency`.

Scenario	write_consistency_factor	read consistency	Trade-off
Similar image cache	1	weak	Maximum speed, minor inconsistencies possible
Document search engine	1	medium	Good balance for most use cases
Read-after-write guarantee	2	strong	Higher latency, no stale reads
Financial / medical data	quorum	quorum	Maximum consistency, reduced availability

**Data loss scenario:** replication_factor: 2, write_consistency_factor: 1. Write confirmed by replica-A. Replica-B has not yet received the data. Replica-A crashes. Data is lost - replica-B doesn't know about this write. If this is unacceptable - use write_consistency_factor: 2 (or 'quorum').

"replication_factor protects against data loss"

replication_factor protects against availability loss (node down - others respond). But with write_consistency_factor: 1 data loss is possible if a node fails BEFORE replication completes. For data durability, use write_consistency_factor: quorum.

Replication and durability are different guarantees. Replication: 'data is on multiple nodes'. Durability: 'data is reliably saved'. With write_consistency: 1, a write can be confirmed before propagating to all replicas - this is asynchronous replication.

A user adds a document and then immediately searches. With write_consistency: 1 and read consistency: 'weak' - the user might not find the just-added document. What is the least costly fix?

Key Takeaways

**replication_factor** - how many copies of each shard. 2+ = fault tolerance. Set at collection creation time
**Raft** manages cluster metadata via consensus. Odd node counts (3, 5) are better than even
**write_consistency_factor:** 1 = high availability, quorum = data durability. Choose based on requirements
**read consistency:** weak (fast) / strong (freshness guarantee). Can be set per request
**Data loss** is possible with write_consistency: 1 + node fails before replication. Use quorum for critical data

What's Next

Replication is configured - now you need visibility into what's happening inside the cluster. Metrics and alerts.

Monitoring and Prometheus — Track replica and shard health through metrics
Distributed Mode — Sharding is the foundation on top of which replication works
Performance Tuning — Replication affects write latency - account for this during optimization

Вопросы для размышления

Explain the difference between replication_factor and write_consistency_factor. Can write_consistency_factor exceed replication_factor? What happens if you try?
In CAP theorem terms, is Qdrant with write_consistency: quorum a CP or AP system? Justify your answer. How does it change with write_consistency: 1?
How do you verify replication is working correctly? Describe a test: what to check, how to simulate a node failure, how to verify data integrity after recovery.

Связанные уроки

dist-11-replication

Replication Factor: Shard Copies Across Multiple Nodes

**Replication** in Qdrant means copying each shard to multiple nodes. `replication_factor: 2` means every shard exists on 2 nodes. If one node goes down, the data remains available on the other.

**Scenarios with different replication_factor values:**

replication_factor

Nodes in cluster

Survives failure of

Write overhead

0 nodes (no fault tolerance)

Minimal

1 of 2 nodes

×2 on writes

1 of 3 nodes (or 2 with strong consistency)

×3 on writes

3 + write_consistency: 2

1 node, writes confirmed by 2 replicas

×3 writes, quorum

Cluster: 3 nodes, shard_number: 3, replication_factor: 2. One node (node2) fails. How many shards now have only 1 replica?

Raft: Consensus for Cluster Metadata

**Raft** is the consensus algorithm Qdrant uses to synchronize cluster metadata: node list, collections, shard configuration. Raft guarantees all nodes share the same view of cluster state.

A 4-node cluster. 2 nodes fail simultaneously. What happens to Qdrant?

Consistency vs Availability: CAP in Qdrant

Scenario

write_consistency_factor

read consistency

Trade-off

Similar image cache

weak

Maximum speed, minor inconsistencies possible

Document search engine

medium

Good balance for most use cases

Read-after-write guarantee

strong

Higher latency, no stale reads

Financial / medical data

quorum

Maximum consistency, reduced availability

"replication_factor protects against data loss"

A user adds a document and then immediately searches. With write_consistency: 1 and read consistency: 'weak' - the user might not find the just-added document. What is the least costly fix?

Key Takeaways

**replication_factor** - how many copies of each shard. 2+ = fault tolerance. Set at collection creation time

**Raft** manages cluster metadata via consensus. Odd node counts (3, 5) are better than even

**write_consistency_factor:** 1 = high availability, quorum = data durability. Choose based on requirements

**read consistency:** weak (fast) / strong (freshness guarantee). Can be set per request

**Data loss** is possible with write_consistency: 1 + node fails before replication. Use quorum for critical data