Stream Processing

Designing Streaming Systems at Scale

Netflix processes 2.5 trillion events per day. Uber handles 1 million rides per day with real-time pricing. Twitter serves 500 million tweets per day. Each of these systems was designed around backpressure, ordering, and fault tolerance - the three pillars of production streaming. Without them the system works in a demo but crashes under load.

**LinkedIn Samza**: streaming system on Kafka. 7 trillion messages per day. Backpressure via Kafka pull model. Fault tolerance via Kafka offset checkpoints
**Uber Flink**: real-time surge pricing. Backpressure at peak load (NYE, concerts). Fault tolerance: checkpoint every 30 sec, RTO < 1 min
**Cloudflare**: 100B+ DNS queries/day through streaming pipeline. Hot keys: popular domains. Solution: two levels of aggregation with salting

Backpressure: When the Consumer Is Slower Than the Producer

2016. Twitter moves Heron to production. The main problem with Storm: no backpressure mechanism - fast producer, slow consumer = OOM and cascading failures. Backpressure is a mechanism where a slow downstream signals the upstream to slow production. Without it, buffers overflow and the system crashes.

Strategies: (1) Pull-based: the consumer requests data when ready (Kafka: consumer poll). (2) Credit-based: upstream gets a 'credit' (N messages) and stops when exhausted (Flink). (3) Drop + retry: drop messages under load, consumer is idempotent (real-time metrics). (4) Blocking: producer blocks - simplest, but deadlock risk.

ML parallel: backpressure in a streaming pipeline is like gradient accumulation in training. When the GPU cannot keep up with a batch - increase the accumulation step. Same mechanisms: downstream signals upstream (memory pressure -> reduce batch_size). Reactive Streams (Java 9+) standardizes backpressure through Publisher.request(N).

Kafka consumers use a pull-based model. What is the advantage over push-based?

Ordering: Global vs Per-Partition

Global event ordering in a distributed system is impossible without coordination (which creates a bottleneck). The practical solution: guarantee ordering only within a partition/key. Kafka: all messages with the same key go to one partition, which maintains order. User transactions are ordered by user_id, not globally.

Out-of-order events: in reality events arrive out of order (network, retries, multiple sources). Flink watermarks solve this: watermark(t) means 'all events before t have been received'. A window waits for the watermark before computing. Trade-off: larger allowed_lateness -> more accurate result, but higher output latency.

Why is global ordering in Kafka unachievable without losing performance?

Partitioning: Scaling Through Sharding

Partitioning is horizontal scaling for streaming. A Kafka topic with 12 partitions = 12 parallel consumers. Strategies: (1) Hash partition by key - even distribution, ordering per key. (2) Round-robin - maximum parallelism, no ordering. (3) Range partition - key ranges, locality for window operations. (4) Custom - by business logic (geo, tenant).

Hot keys are the primary problem in streaming. Twitter: one trending hashtag -> 90% of events in one partition. Solutions: salting (append suffix 0-9), two-stage aggregation, adaptive partitioning (Flink automatic), custom partitioner. ML parallel: class imbalance in streaming training - same patterns of oversampling/undersampling.

What is a 'hot key' in a streaming system and how is it solved?

Fault Tolerance: Checkpoints and Recovery

A streaming system runs 24/7. A node fails - what happens? Flink uses distributed snapshots (Chandy-Lamport algorithm). Every N seconds: a synchronous checkpoint of all state (operator state + Kafka offsets) into durable storage (S3, HDFS). On failure: restart from the last checkpoint, replay events from Kafka from the checkpoint offset.

Exactly-once vs at-least-once: exactly-once in Flink requires 2-phase commit with Kafka (KafkaSink with transactional producer). Cost: latency +checkpoint interval, throughput -20-30%. For most analytics, at-least-once is sufficient (idempotent aggregation). Exactly-once is critical for financial transactions.

Exactly-once guarantees that an event is physically processed exactly one time

Exactly-once means the processing result is visible exactly once, even if the event is physically processed multiple times during recovery

During checkpoint replay events are reprocessed, but state already contains their results - reprocessing is idempotent. This is a semantic guarantee, not a physical constraint

After recovery from a Flink checkpoint, events are replayed from Kafka. Why does this not cause duplicates?

Key Ideas

**Backpressure**: pull-based (Kafka) or credit-based (Flink). A slow consumer must not kill the system - lag grows, not OOM.
**Ordering**: global ordering = bottleneck. Per-partition ordering + watermarks for out-of-order events.
**Hot keys**: salting (key+suffix) + two-stage aggregation. Trending hashtag in one partition = disaster.
**Fault tolerance**: Chandy-Lamport checkpoint = operator state + Kafka offset in S3. Restart -> replay from checkpoint offset.
**Exactly-once**: semantic guarantee via 2PC + idempotency. Physically an event may be processed multiple times.

Вопросы для размышления

Flink checkpoints every 5 seconds with RocksDB state. With 100GB state per checkpoint to S3 - what retention is needed and how does checkpointing affect throughput?
A hot key is solved with 10-bucket salting. But events for one user_id are now in 10 partitions - how to aggregate with ordering guarantees?
Recovery from checkpoint takes 3 minutes for Kafka replay. Downstream systems are waiting. How to minimize RTO without reducing checkpoint interval?

Связанные уроки

stream-14 — Stream-Table Duality is the foundation for understanding system design
stream-09 — Exactly-Once is the key property of fault tolerant streaming
stream-08 — Windowing and time affect ordering and fault tolerance
stream-16 — Design patterns are the basis for streaming system design interviews
ds-08-vector-clocks — Kafka/Kinesis as distributed queue: same principles of backpressure and ordering
dist-14-sharding