Big Data

Apache Flink: Stream Processing

Alibaba, Singles' Day 2015

November 2015, Singles' Day. Alibaba processes logs of one billion events per second - clicks, carts, payments. Spark Streaming with micro-batching: 500ms latency. Sufficient for logging, but not for real-time fraud detection - fraudulent transactions need to be blocked immediately. Alibaba switched to Flink: latency dropped to 10ms. The secret is true streaming, where each event is processed instantly, not in batches. Today Alibaba is the largest contributor to Apache Flink, processing exabytes of data through it annually. The pressure of Singles' Day turned an academic project from Berlin Technical University (TU Berlin, 2011) into a production-grade system.

Micro-batch is a lie. Spark Streaming is not streaming - it is batch processing with small batches. Every 100ms a batch forms, an RDD is created, a Spark job launches. Latency is determined by batch size, not processing speed. Flink was the first to be honest: real streaming means an operator receives an event and processes it immediately. No waiting for the end of a batch.

**Alibaba**: Flink processes real-time logs of 1B events/sec during Singles' Day - fraud detection with 10ms latency
**Netflix**: Flink pipeline for real-time personalization - recommendation updates at the moment of viewing, not 5 minutes later
**Lyft**: stateful ride sessionization on Flink - complex driver behavior patterns without an external database

Event Time vs Processing Time

Singles' Day 2015, Alibaba. One billion events per second. Spark Streaming with micro-batching delivers 500ms latency - every half second a mini-batch forms, gets processed, results written. Flink with true streaming delivers 10ms. The difference is not in hardware, not in network speed. The difference is architectural: Spark waits, Flink processes each event the moment it arrives.

But any streaming system faces a fundamental problem: **event time** and **processing time** are different things. A user taps a button in a mobile app at 14:00:00.000. The event arrives at the server at 14:00:00.350 (350ms network delay). Flink processes it at 14:00:00.360. Three different timestamps for one fact.

**Processing time** - the system clock of the Flink operator when it processes the record. Fast, no waiting, but non-deterministic: run the same stream twice and windows differ. **Event time** - the timestamp embedded in the event itself (JSON field, Kafka record timestamp). Results are deterministic, but require a mechanism for handling late arrivals.

The choice between event time and processing time is a trade-off between correctness and latency. Real-time dashboards with sub-millisecond latency choose processing time - there is no time to wait for late events. Billing, transaction analytics, SLA metric calculation - event time is mandatory: events from different time zones must land in the correct temporal windows.

A team is building a billing system: sum user transactions per hour for invoice generation. Which time mode is required?

Watermarks: Time Progress in the Stream

Event time creates a problem: Flink receives events out of chronological order. An event timestamped 14:00:00 may arrive after one timestamped 14:00:05. How long to wait before closing a time window? Wait forever - results never arrive. Do not wait - lose late events. A watermark is the answer: "after receiving this marker, all events with timestamp <= T have already arrived".

Watermark W(t) is a special marker flowing through the data stream alongside events. When an operator receives watermark W(t), it knows: windows with end_time <= t can be closed. Late events (event time < current watermark) are handled by a separate strategy.

**Late event strategies**: `allowedLateness(Duration)` - window stays alive longer, updated on each late arrival. `sideOutputLateData(OutputTag)` - late events routed to a side stream for separate handling. Combining watermark + allowedLateness allows balancing latency and completeness.

Parallel streams add complexity: Flink takes the minimum watermark across all incoming streams. One slow partition or idle source blocks watermark progress for the entire operator. This is why `withIdleness()` is critical in production - without it, one empty Kafka partition freezes the entire pipeline.

A Flink job reads from a Kafka topic with 8 partitions. Partition 3 stops receiving events for 5 minutes. What happens to the watermark without `withIdleness()`?

Stateful Operators: State as a First-Class Citizen

Most streaming systems pretend each event is processed in isolation. Flink is honest: state is a first-class construct. An operator can store arbitrary state that survives restarts and is persisted in checkpoints. This enables a class of tasks impossible for stateless systems: sessionization, exact-once aggregations, ML inference with accumulating context.

Keyed state is the most common pattern: the stream is partitioned by key (`keyBy(UserEvent::getUserId)`), each key gets isolated state. Data races are impossible by design - Flink guarantees all events for one key are processed strictly sequentially in a single task. This makes fraud detection, sessionization, and anomaly detection achievable in dozens of lines of code, no external database required.

**State types**: `ValueState<T>` - single value; `ListState<T>` - list; `MapState<K,V>` - dictionary; `ReducingState<T>` - incremental aggregation; `AggregatingState<IN,OUT>` - custom aggregation. **State backends**: MemoryStateBackend (dev only), FsStateBackend (HDFS/S3), RocksDBStateBackend (production, terabytes of state).

RocksDB state backend: state lives on local disk, not JVM heap - GC pauses are independent of state size. The only reasonable choice for state > a few GB. Downside: read/write operations are slower than heap (~10 us vs ~1 us). For hot-path operations with small objects, HashMapStateBackend (in-heap) can be faster.

A Flink job counts purchases per user for fraud detection. State backend: MemoryStateBackend. The job processes 100K unique users and state grows. What problem does this create?

Checkpointing: Chandy-Lamport in Production

1985. California Institute of Technology. Mani Chandy and Leslie Lamport publish the distributed snapshot algorithm. The problem: capture a consistent global state of a distributed system without stopping computation. The solution is elegant: special markers (barriers) flow through data channels, operators snapshot their state upon receiving a barrier. Thirty years later, this algorithm became the foundation of Flink checkpointing.

Flink Checkpoint works as follows: the JobManager periodically injects a barrier into every source. A barrier is a special marker saying "checkpoint N starts here". The source records its offset (position in Kafka), forwards the barrier. When an operator receives the barrier from all incoming channels (barrier alignment), it saves its state to distributed storage (HDFS, S3) and sends the barrier downstream. The checkpoint completes when all sinks confirm receipt.

Exactly-once semantics require barrier alignment: an operator buffers data from fast input channels while waiting for barriers from slow ones. This adds latency. Alternative: at-least-once mode without alignment - barriers do not wait for each other, events may enter state before the snapshot is taken. On recovery, some events are reprocessed.

**Unaligned checkpoints** (Flink 1.11+) - a hybrid: barriers do not wait for alignment, but data in operator buffers is included in the checkpoint. Latency like at-least-once, semantics of exactly-once. Solves the backpressure-induced checkpoint timeout: under backpressure, barriers get stuck in queues for a long time, and aligned checkpoints miss their deadline.

A savepoint is a manually triggered checkpoint with compatibility guarantees across job versions. A checkpoint is automatic and may be incompatible after topology changes. Pipeline upgrade procedure: stop the job with a savepoint (`flink cancel --withSavepoint`), deploy the new version, restart from that savepoint. Zero data loss, minimal downtime.

Exactly-once in Flink means each event is processed exactly one time

Exactly-once means each event has exactly one effect on state and output - even if it is physically reprocessed during recovery

During recovery from a checkpoint, events after the last checkpoint are replayed from Kafka. They are processed a second time, but the state already holds the snapshot taken before those events - the net effect is as if they were processed once. This is end-to-end exactly-once, not event-level exactly-once.

A Flink job with EXACTLY_ONCE semantics experiences backpressure: one operator cannot keep up. How does this affect checkpointing?

Key Ideas

**True streaming vs micro-batching**: Flink processes each event immediately - 10ms latency vs 500ms in Spark Streaming
**Event time + Watermarks**: deterministic results for out-of-order events - watermark W(t) signals all events before t have arrived
**Keyed state**: state is isolated per key, survives restarts, scales to terabytes with RocksDB backend
**Chandy-Lamport checkpointing**: barriers in the data stream enable exactly-once without stopping processing; unaligned checkpoints resolve the backpressure timeout problem

Вопросы для размышления

Why is watermark determined by the minimum across all Kafka partitions when reading in parallel, and how does this affect idle partitions?
What is the difference between a checkpoint and a savepoint when upgrading a pipeline without data loss?
When is exactly-once semantics in Flink insufficient and an idempotent sink is required - which storage systems support this natively?

Связанные уроки

bd-04 — Spark RDD and lazy DAG - baseline for understanding true streaming differences
bd-05 — Spark SQL micro-batching - the reference point for comparing architectures
bd-10 — Kafka as event source for Flink - the standard production pairing
bd-11 — Stream processing patterns are implemented through Flink windowing and state
ds-02-cap-theorem — Chandy-Lamport checkpoint is a distributed snapshot - applied distributed systems theory
ds-01-intro