Backend Transport

Dead Letter Queue, Retry, and Circuit Breaker

Black Friday, 11:59 PM. The Payment service starts slowing down due to DB load. Hundreds of microservices send retry requests - making it even more overloaded. Two minutes later the entire cluster stalls: thread pools are exhausted waiting for payments. Netflix calls this a 'distributed system death spiral'. Circuit Breaker, DLQ, and Bulkhead are the barriers against this scenario.

  • **AWS SQS** has a built-in Dead Letter Queue with redrive policy. Every SQS consumer in Amazon production is configured with a DLQ and an alert on any message appearing there.
  • **Netflix Hystrix** (now Resilience4j) protects 500+ Netflix services. The circuit breaker opens at 50%+ error rate over 10 seconds and closes after 5 seconds in HALF-OPEN.
  • **Google SRE** documents exponential backoff as a mandatory requirement for all RPC calls. Without jitter and backoff, a retry storm in 2012 caused several hours of downtime for Google App Engine.

Retry Strategies

Most network errors in distributed systems are transient: a timeout due to a GC pause, an overloaded DB, a network blip. Retry allows the system to survive brief failures automatically. The key question is when and how often to retry without overloading the failing service.

AWS SDK, gRPC-js, and axios-retry all implement retry out of the box. In Kafka Consumer, retry is typically implemented via dedicated topics: `orders.events.retry-1`, `orders.events.retry-2`, `orders.events.dlq`.

HTTP 400 Bad Request from an external API call - should it be retried?

Exponential Backoff with Jitter

Naive retry with a fixed interval creates thundering herd: thousands of clients retry simultaneously after a server crash. Exponential backoff increases the delay exponentially; jitter (random offset) prevents clients from synchronizing.

The AWS 2015 article 'Exponential Backoff And Jitter' showed that decorrelated jitter produces 30% less load on a recovering server compared to full backoff without jitter. Jitter is critical with 1000+ clients.

Why add jitter (random offset) to exponential backoff?

Dead Letter Queue

A Dead Letter Queue (DLQ) is a special queue for messages that could not be processed after all retries. Instead of losing the message or blocking the queue, it is moved to the DLQ for manual analysis or later reprocessing.

The DLQ must be visible: alert on any message appearing in the DLQ. This is not a normal situation - it is a signal of a bug in the consumer or bad data. Amazon configures DLQ for every SQS consumer in production.

What should happen when messages appear in the DLQ?

Circuit Breaker

Circuit Breaker is a pattern for protection against cascading failures. When a dependency (DB, external API) degrades, constant retries exhaust the calling service's resources. The Circuit Breaker tracks error rate and, when the threshold is exceeded, 'opens the circuit' - returning an error without actually calling the dependency.

Netflix introduced Hystrix (Circuit Breaker library) in 2011. Hystrix is now in maintenance mode; Netflix uses Resilience4j. Envoy Proxy (Istio) implements Circuit Breaker at the service-mesh level - no code changes required.

Why does the Circuit Breaker transition to HALF-OPEN after a timeout?

Bulkhead Pattern

Bulkhead (ship compartment) - isolation of resources for different request types. If one type of request consumes all thread pool slots or connections, others cannot work either. Bulkhead allocates a separate resource pool for each critical function.

Netflix applies bulkhead at the Hystrix thread-pool level: each downstream service has its own thread pool. If the Recommendations API hangs, the Payment API continues working with a separate thread pool.

Circuit Breaker and retry are competing patterns - you must choose one

They complement each other: retry handles brief failures, circuit breaker protects during prolonged degradation

The correct combination: retry with backoff for the first 3 attempts, circuit breaker opens if the service degrades for minutes. DLQ stores messages when the circuit is open. Bulkhead prevents propagation to other components.

What problem does the Bulkhead Pattern solve?

Key Ideas

  • **Retry only for transient errors** (5xx, timeout). Permanent errors (4xx) should not be retried - the same result will come back.
  • **Exponential backoff with jitter** prevents thundering herd: clients spread retries over time instead of a synchronized storm.
  • **DLQ + Circuit Breaker + Bulkhead** - three protection layers: DLQ preserves unprocessed messages, CB stops the storm, Bulkhead isolates degradation.

Related Topics

Retry, DLQ, and Circuit Breaker form a reliability pattern on top of any transport:

  • RabbitMQ and Queues — DLQ is a native RabbitMQ mechanism via dead-letter-exchange; retry topics are a pattern from the Kafka ecosystem
  • Backpressure and Flow Control — Backpressure prevents overload proactively; Circuit Breaker reacts to degradation that has already occurred

Вопросы для размышления

  • How do you distinguish a transient error (retryable) from a permanent one in the context of an external API that does not return HTTP status codes?
  • At what error rate should the Circuit Breaker open? How does this threshold depend on the criticality of the service?
  • How do you implement replay from DLQ back to the main queue without breaking event ordering?

Связанные уроки

  • db-03-acid
  • alg-22-backtracking
Dead Letter Queue, Retry, and Circuit Breaker

0

1

Sign In