Stream Processing
Batch vs Stream Processing
A card transaction - 100 milliseconds to check for fraud. Miss the window and the money is gone. A monthly spending report - hours. Kafka at LinkedIn processes 7 trillion messages per day. Netflix runs Apache Flink on 2 million events per second. One company - batch for overnight ML model training, stream for 'Trending Now' right now. The architect's job is not to pick one, but to know what each task actually needs.
- **Kafka (LinkedIn)** - 7 trillion messages per day; Kafka Streams = stream processing without a separate cluster, running inside a Java application
- **Flink (Netflix)** - 2 million events per second, single-digit millisecond latency; personalization, monitoring, A/B tests in real time
- **Spark Structured Streaming vs Flink**: Spark - micro-batch (100 ms - 2 s, simple API); Flink - true stream (1-10 ms, full control, more complex)
Batch Processing
Every night Amazon recalculates recommendations for millions of users. Uber tallies up the day's rides. Netflix updates movie ratings. All of this is **batch processing**: data accumulates, gets processed as a whole, and results come out. Apache Spark runs these jobs on clusters handling petabytes of data - batch mode is exactly what trains ML models at LinkedIn, Meta, and Airbnb.
**Batch processing** - processing a finite, fixed dataset. Data is already collected (bounded dataset). The system reads all input, performs computations, writes all output. Latency ranges from minutes to hours.
| Characteristic | Value |
|---|---|
| Data | Bounded dataset |
| Latency | Minutes - hours |
| Throughput | Very high |
| Fault tolerance | Restart task from checkpoint |
| Tools | Hadoop MapReduce, Spark, Hive, Presto |
| Use cases | ETL, reports, ML training, data warehouse |
The main advantage of batch: **simplicity and reliability**. Data doesn't change during processing, the result is deterministic. You can restart the job on failure. The main drawback: results are stale by the time they're ready.
MapReduce (Google, 2004) → Spark (2012) → modern engines. Spark replaced MapReduce thanks to in-memory processing: data is kept in RAM between steps instead of written to disk. Speedup 10-100x. Spark Structured Streaming - micro-batch on top of the same engine - is used by Kafka while processing 7 trillion messages per day at LinkedIn.
What is the main drawback of batch processing?
Micro-batch Processing
Running batch every 2 seconds yields **micro-batch** - a compromise between batch and stream. Data accumulates in small portions and is processed as mini-batches. Latency drops to seconds.
**Spark Structured Streaming** is the main representative of the micro-batch approach. Every N seconds (default 100 ms – 2 s) incoming data is combined into a mini-DataFrame and processed with the same API as batch Spark. This simplifies the transition from batch to 'near-stream'.
Micro-batch inherits all batch advantages: exactly-once semantics (process-exactly-once guarantee), simple API, excellent fault tolerance. Drawback - latency cannot be lower than the micro-batch duration (typically 100 ms – a few seconds).
| Property | Batch | Micro-batch |
|---|---|---|
| Latency | Minutes - hours | Seconds |
| Throughput | Maximum | High |
| Semantics | Exactly-once | Exactly-once |
| API complexity | Simple | Same as batch |
| Tools | Spark batch | Spark Streaming |
For many tasks micro-batch is the ideal choice. Dashboards updating every 5 seconds? Micro-batch. Aggregating metrics for the last minute? Micro-batch. Spark Structured Streaming holds latency at 100 ms - 2 s. But Netflix processes 2 million events per second through Apache Flink with latency in single-digit milliseconds - that requires true stream.
What is the main advantage of micro-batch over true stream processing?
Stream Processing
**Stream processing** - processing each event as it arrives, without waiting. Event arrives - gets processed - result is ready. Latency is measured in milliseconds. Apache Flink leads: Netflix runs 2 million events per second on it for personalization and monitoring. Kafka Streams is Flink's embedded counterpart inside Kafka itself - at LinkedIn it handles 7 trillion messages per day.
The key difference from batch: data is **unbounded** - the stream has no end. There's no 'whole dataset'. You need to define what 'the last 5 minutes' means (windowing), what to do with late-arriving data (watermarks), and how to guarantee correctness on failures.
**Windowing** - the primary aggregation mechanism in streams. Tumbling window (non-overlapping windows: [0-5s], [5-10s], ...), Sliding window (overlapping: [0-5s], [2-7s], ...), Session window (by user activity, with gap timeout).
| Property | Micro-batch | True Stream |
|---|---|---|
| Latency | 100ms - seconds | Milliseconds |
| Model | Mini-DataFrame | Event by event |
| Windowing | Limited | Full (tumbling, sliding, session) |
| Watermarks | No | Yes - managing late-arriving data |
| Exactly-once | Built-in | Via checkpointing |
| Tools | Spark Streaming | Flink, Kafka Streams, ksqlDB |
What are watermarks used for in stream processing?
Lambda and Kappa Architectures
Combining the advantages of batch (accuracy, completeness) and stream (speed) is the goal of **Lambda Architecture**, proposed by Nathan Marz: two parallel layers - batch (accurate, slow) and speed (fast, approximate). Results are merged at query time.
Jay Kreps (creator of Kafka) proposed an alternative - **Kappa Architecture**: only one stream pipeline. Kafka stores historical data (retention), so the stream can be 'rewound' and re-read from scratch when needed. One pipeline, one logic.
| Characteristic | Lambda | Kappa |
|---|---|---|
| Number of pipelines | 2 (batch + speed) | 1 (stream only) |
| Code duplication | High | None |
| Complexity | High (merge views) | Medium |
| Reprocessing | Batch restart | Re-read Kafka from the start |
| Batch accuracy | Guaranteed | Depends on stream engine |
| When to choose | ML + real-time; legacy batch | New projects; Flink + Kafka |
Modern trend: Kappa is winning. Apache Flink provides exactly-once semantics, Kafka stores data for years (tiered storage). The need for a separate batch layer is shrinking. But for ML training and complex analytics, batch remains optimal.
In practice many systems use a hybrid approach: Kafka for ingestion, Flink for real-time processing, Spark for heavy analytics and ML. It's important to choose the architecture based on specific requirements for latency, accuracy, and complexity.
Batch processing is outdated and unnecessary
Batch remains optimal for tasks that don't require low latency: ETL to a data warehouse, ML model training, report generation, full reindexing. Batch throughput is higher, implementation is simpler, and exactly-once is easier to guarantee
Spark Structured Streaming delivers 100 ms - 2 s latency; Flink delivers single-digit milliseconds. But Flink requires managing state, watermarks, and out-of-order data. PyTorch model training, BigQuery ETL, Uber's nightly reports - all of these are batch, and that's the correct choice. Not 'stream beats batch', but 'different tools for different latency requirements'
The main problem with Lambda Architecture that Kappa solves:
Key Takeaways
- **Batch** - bounded dataset, maximum throughput, minutes-hours latency; Spark trains ML models at LinkedIn, Meta, and Airbnb
- **Micro-batch** - Spark Structured Streaming, 100 ms - 2 s latency, same DataFrame API as batch; exactly-once built in
- **True Stream** - Flink processes 2M events/sec at Netflix with 1-10 ms latency; windowing and watermarks handle out-of-order data
- **Lambda** = batch + stream (duplicated logic); **Kappa** = Kafka + Flink only (one pipeline, re-read from start when logic changes)
- **Kafka** - 7 trillion messages/day at LinkedIn; not just a broker but persistent storage for the Kappa architecture
Related Topics
Batch vs Stream - the foundation for data processing architectures:
- Event-Driven Architecture — Events and commands - the building blocks of stream systems
- Message Brokers — Kafka, RabbitMQ, NATS - transport for streaming data
Вопросы для размышления
- Kafka at LinkedIn handles 7 trillion messages per day. Which tasks from that stream require true stream processing (Flink, ms latency), and which can be handled by micro-batch (Spark, 2 s latency) - and why does the separation matter?
- Netflix keeps batch Spark for overnight ML model retraining and Flink for real-time events. Why not go fully Kappa - what makes batch irreplaceable for ML training?
- Backpressure is a mechanism where an overloaded consumer signals the producer to slow down. How does Flink implement this in the context of watermarks and windowing?
Связанные уроки
- stream-02 — Event-driven architectures are built on stream processing - the natural next step.
- stream-03 — Kafka and other brokers are the transport layer for all streaming architectures.
- bd-01 — Big Data Volume and Velocity are exactly the environment batch and stream models were designed for.
- st-01-feedback-loops — Lambda/Kappa architectures are feedback-loop systems: the batch layer corrects the stream layer, like negative feedback in systems theory.
- prob-03-conditional — Watermarks are probabilistic estimates of stream completeness; conditional probability intuition makes them concrete.