System Design
Message Queue
Цели урока
- Understand why Message Queues are needed
- Distinguish between Queue (point-to-point) and Pub/Sub (fan-out)
- Know delivery guarantees: at-most-once, at-least-once, exactly-once
- Be able to design idempotent consumers
- Understand DLQ and retry strategies
- Know when to use RabbitMQ vs Kafka
Предварительные знания
- Understanding of asynchronous programming
- Basic knowledge of distributed systems
Why Message Queues Are Needed
**Message Queue** - an intermediate layer for asynchronous communication. A producer sends a message → the queue stores it → the consumer processes it when ready.
A user registers. You need to send a welcome email. Synchronously = 2 seconds of waiting. With a queue = instant response.
- Decoupling: services don't know about each other, they communicate through a queue
- Buffering: a queue smooths out traffic spikes (Black Friday)
- Async: heavy tasks (video encoding) run in the background
Email on registration
Synchronous vs asynchronous approach
WITHOUT A QUEUE: • User registers → API sends email → 200 OK • Latency: 100ms (API) + 2000ms (SMTP) = 2100ms • Problem: SMTP is down = registration fails WITH A QUEUE: • User registers → API puts job in queue → 200 OK • Worker picks up job → sends email → retry on error • Latency: 100ms + 5ms (enqueue) = 105ms • SMTP is down = job retries later, user doesn't wait
When you need a queue
- **Heavy tasks**: email, PDF, video transcoding
- **Traffic spikes**: flash sale, queue buffers the load
- **Decoupling**: OrderService doesn't wait for InventoryService
- **Retry logic**: a failed job stays in the queue and can be retried
- **Event-driven**: one event → many consumers
Which problem does a Message Queue NOT solve?
Queue vs Pub/Sub
Two main patterns: **Queue** (point-to-point) and **Pub/Sub** (publish-subscribe). Different use cases.
Queue (Point-to-Point)
Pub/Sub (Fan-out)
| Criterion | Queue | Pub/Sub |
|---|---|---|
| Recipients | One (load balanced) | All subscribers |
| Coupling | Producer knows the queue | Publisher doesn't know subscribers |
| Use case | Job processing | Event broadcasting |
| Example | Send email job | Order created event |
An order is created. You need to: notify the warehouse, send an email, record in analytics. What to use?
Delivery Guarantees
Delivery guarantees are a trade-off between reliability and performance. Three levels:
**At-least-once** = industry standard. The consumer MUST be idempotent: processing the same message twice = the same result.
**Idempotency Key**: Store IDs of processed messages in Redis with TTL. Check before processing.
With at-least-once delivery, why is an idempotent consumer needed?
DLQ and Retry Logic
What to do if a message fails to process? **Retry** with exponential backoff. If it still doesn't work - **Dead Letter Queue (DLQ)**.
Visibility Timeout
Dead Letter Queue
**DLQ monitoring** is mandatory! Alert when size > 0 or growth rate increases. A growing DLQ = something is broken.
A message ended up in the DLQ. What to do?
Backpressure
**Backpressure** - a mechanism to protect against overload. If a consumer can't keep up, you need to slow down the producer or scale consumers.
Backpressure strategies
| Strategy | How it works | When |
|---|---|---|
| Bounded Queue | Size limit → reject on overflow | Critical systems |
| Rate Limiting | Producer is limited to N msg/sec | Controlled input |
| Auto-scaling | Queue grows → add workers | AWS Lambda + SQS |
| Sampling | Drop a fraction of messages | Metrics, logs |
**Best Practice**: Monitor Queue Depth and Age of Oldest Message. If they grow = consumers can't keep up. Set up auto-scaling or alerts.
The queue is growing faster than consumers can process. What is the first action?
RabbitMQ vs Kafka
Two major players: **RabbitMQ** (traditional message broker) and **Kafka** (distributed event log). Different models, different use cases.
RabbitMQ
Kafka
| Criterion | RabbitMQ | Kafka |
|---|---|---|
| Model | Message Broker | Event Log |
| Delivery | Push | Pull |
| After consume | Message deleted | Message retained |
| Throughput | 10K-100K msg/s | Millions msg/s |
| Replay | Hard | Easy |
| Use case | Task queue, RPC | Event streaming, Analytics |
**Choosing**: RabbitMQ for job queues, RPC, routing. Kafka for event streaming, high-volume, when replay is needed.
You need to process 1M events/sec with the ability to replay. What to choose?
Key Takeaways
- **Queue** - point-to-point, one consumer. **Pub/Sub** - fan-out, all subscribers
- **At-least-once + idempotent consumers** - industry standard
- **DLQ** - for failed messages, monitor its size
- **Backpressure** - auto-scale consumers as the queue grows
- **RabbitMQ** for task queues. **Kafka** for event streaming
Related Topics
Queues are the foundation of async architecture
- Microservices — Async communication between services
- Event Sourcing — Kafka as an event log
- CQRS — Commands via queue, events via pub/sub