Backpressure
During a 2016 Amazon sale, the order-processing consumer started lagging. With no backpressure the producer kept dumping events. The in-memory queue grew to several gigabytes, the JVM hit OOM, the service died. With backpressure the producer would have slowed down, the system would have degraded gracefully instead of crashing.
- **Kafka consumer lag**: Uber monitors lag on the payment consumer. When lag > 30 seconds, Kubernetes HPA spins up an extra consumer instance automatically.
- **Netflix Hystrix / Resilience4j**: when the profile service is unreachable, the circuit breaker flips OPEN and the page serves a cached profile instead of a 503 for 200M+ users.
- **TCP receive window**: a browser downloads a video over HTTP/1.1. The receive window shrinks when the video decoder can't read the buffer fast enough. TCP auto-throttles the server without a single line of application code.
- **RxJava onBackpressureDrop** on Android: the accelerometer fires 100 events/sec, the UI updates at 60fps. Extra events drop, battery and CPU saved without losing smoothness.
Backpressure Concept
Backpressure is the mechanism where a **slow consumer signals the producer to slow down**. Without that signal the producer accumulates events in memory or loses them when the buffer overflows.
The classical illustration is the TCP receive window. The receiver announces in the TCP header how many bytes it can take right now (rwnd). The sender is not allowed to push more without acknowledgement. If the receiving application reads slowly, rwnd shrinks to zero and the sender stops entirely. Built-in backpressure at the transport layer.
Reactive Streams (a standard for Java/Kotlin: RxJava 2+, Project Reactor, Akka Streams) formalizes backpressure through `request(n)`: the subscriber explicitly signals how many items it can process. The publisher is forbidden from sending more than that.
TCP receive window (rwnd) equals 0. What happens to the sender?
Buffering
Buffering is the first and most intuitive answer to backpressure: queue events while the consumer catches up. Kafka consumer lag is the live metric for this buffer: the gap between the last written offset and the last read offset shows how far behind the consumer is.
A buffer has three critical parameters: **size** (how many events fit), **overflow policy** (what to do when full), and **time-to-live** (when an event becomes useless). Kafka stores events on disk, so the buffer can be huge, but it isn't free: bigger lag means processing delay and risk of out-of-order processing.
- **In-memory queue**: minimal latency, lost on restart (RxJava `onBackpressureBuffer`)
- **Disk-backed queue**: Kafka, RabbitMQ. Survives restart but requires serialization
- **Bounded queue**: capped size. On overflow the dropping policy kicks in
- **Unbounded queue**: grows to OOM. Dangerous in prod without lag/heap monitoring
Kafka consumer lag is monitored through `records-lag-max` (JMX) or the Prometheus kafka-exporter. Alert on lag > N seconds (not absolute, but lag / consumption rate) to get warning before the queue overflows.
Kafka consumer lag grew from 1,000 to 500,000 over 10 minutes. Producer rate is 10,000 msg/s. What does this mean?
Dropping
When the buffer is full and slowing the producer is not an option, the system has to choose what to drop. A dropping policy is an explicit choice of which data is least valuable. Implicit dropping (say, the OOM killer terminating the process) is worse than any explicit policy.
- **DROP_OLDEST**: drops the oldest events. Fits telemetry, where fresh data matters more
- **DROP_LATEST**: drops new events. Protects already accumulated data from loss
- **DROP_ALL**: clears the entire buffer at once. Used for state-based data (latest GPS position)
- **SAMPLE**: forwards every Nth element. Fits high-frequency metrics where intermediate values are redundant
UDP is the archetypal DROP_LATEST without buffer: if the app didn't read the datagram from the socket, the kernel simply overwrites it with the next one. That is why UDP is used for video conferencing (a stale frame is useless) and DNS (cheaper to retry than to buffer).
A system receives GPS coordinate updates 50 times per second, but the UI renders at 30 fps. Which dropping policy is optimal?
Flow Control and Circuit Breakers
Flow control is active management of producer rate based on signals from the consumer. A circuit breaker is the extreme case: when downstream degrades, upstream fully stops sending requests to give the system time to recover.
Netflix Hystrix (now Resilience4j) implements a circuit breaker with three states. CLOSED: normal operation, requests flow through. OPEN: tripped, all calls return a fallback immediately without touching downstream. HALF-OPEN: periodic probe requests check whether the service has recovered.
Reactive Streams `request(n)` is pull-based flow control: the consumer explicitly pulls data as it is ready. The opposite of push, where the producer sends without regard for the consumer. Project Reactor (Spring WebFlux) builds its entire reactive chain on this mechanism.
A Kafka consumer controls flow through `max.poll.records` and the `pause()/resume()` TopicPartition API. When downstream is overwhelmed, the consumer calls `consumer.pause(partitions)`: polling continues (heartbeat stays alive, no rebalance triggered), but no data returns until `resume()`.
Backpressure is just adding a bigger buffer
A buffer only delays the problem. Real backpressure is a signal to the producer to slow down, or a mechanism that protects the system from cascading failure (circuit breaker, dropping policy, pull-model).
An unbounded buffer with a constantly overloaded consumer will hit OOM. The right move is either to bound the producer to consumer speed (flow control) or to explicitly choose what to drop (dropping policy).
A circuit breaker just entered OPEN. What happens to incoming requests?
Key ideas
- Backpressure is a signal from consumer to producer: slow down. TCP rwnd, Reactive Streams `request(n)`, Node.js `readable.pause()`: the same idea at different layers.
- A buffer postpones the problem; it does not solve it. Kafka consumer lag measures the consumer's 'debt' to the producer. An unbounded buffer leads to OOM.
- Dropping policy is an explicit choice of what to lose: DROP_OLDEST for fresh data (telemetry), DROP_LATEST for accumulated data (transactions), onBackpressureLatest for state (GPS, UI).
- A circuit breaker protects downstream from overload. The OPEN state returns a fallback instantly, giving the service time to recover without a thundering herd on reopen.
Related topics
Backpressure runs through every layer of real-time architecture, from protocols to resilience patterns:
- Rate Limiting — Rate limiting caps the producer at the system's edge. Backpressure is a signal from inside the system, from consumer to producer
- Message Queues — Queues (Kafka, RabbitMQ) implement buffering for backpressure. Consumer lag is a direct measurement of backpressure in the system
- Circuit Breaker — A circuit breaker is the extreme case of backpressure: a full halt of the flow on downstream degradation
Вопросы для размышления
- A service processes payments and consumes events from Kafka. Consumer lag has started growing. Which three steps, in priority order, should you take before scaling out consumer instances?
- What is the core difference between `onBackpressureDrop` and `onBackpressureLatest` in RxJava? In which scenarios is each one dangerous?
- A circuit breaker is in HALF-OPEN. The first of 5 probe requests failed. How should the breaker behave, and why?