Stream Processing

Batch vs Stream Processing

A card transaction - 100 milliseconds to check for fraud. Miss the window and the money is gone. A monthly spending report - hours. Kafka at LinkedIn processes 7 trillion messages per day. Netflix runs Apache Flink on 2 million events per second. One company - batch for overnight ML model training, stream for 'Trending Now' right now. The architect's job is not to pick one, but to know what each task actually needs.

**Kafka (LinkedIn)** - 7 trillion messages per day; Kafka Streams = stream processing without a separate cluster, running inside a Java application
**Flink (Netflix)** - 2 million events per second, single-digit millisecond latency; personalization, monitoring, A/B tests in real time
**Spark Structured Streaming vs Flink**: Spark - micro-batch (100 ms - 2 s, simple API); Flink - true stream (1-10 ms, full control, more complex)

Batch Processing

Every night Amazon recalculates recommendations for millions of users. Uber tallies up the day's rides. Netflix updates movie ratings. All of this is **batch processing**: data accumulates, gets processed as a whole, and results come out. Apache Spark runs these jobs on clusters handling petabytes of data - batch mode is exactly what trains ML models at LinkedIn, Meta, and Airbnb.

**Batch processing** - processing a finite, fixed dataset. Data is already collected (bounded dataset). The system reads all input, performs computations, writes all output. Latency ranges from minutes to hours.

Characteristic	Value
Data	Bounded dataset
Latency	Minutes - hours
Throughput	Very high
Fault tolerance	Restart task from checkpoint
Tools	Hadoop MapReduce, Spark, Hive, Presto
Use cases	ETL, reports, ML training, data warehouse

The main advantage of batch: **simplicity and reliability**. Data doesn't change during processing, the result is deterministic. You can restart the job on failure. The main drawback: results are stale by the time they're ready.

MapReduce (Google, 2004) → Spark (2012) → modern engines. Spark replaced MapReduce thanks to in-memory processing: data is kept in RAM between steps instead of written to disk. Speedup 10-100x. Spark Structured Streaming - micro-batch on top of the same engine - is used by Kafka while processing 7 trillion messages per day at LinkedIn.

What is the main drawback of batch processing?

Micro-batch Processing

Running batch every 2 seconds yields **micro-batch** - a compromise between batch and stream. Data accumulates in small portions and is processed as mini-batches. Latency drops to seconds.

**Spark Structured Streaming** is the main representative of the micro-batch approach. Every N seconds (default 100 ms – 2 s) incoming data is combined into a mini-DataFrame and processed with the same API as batch Spark. This simplifies the transition from batch to 'near-stream'.

Micro-batch inherits all batch advantages: exactly-once semantics (process-exactly-once guarantee), simple API, excellent fault tolerance. Drawback - latency cannot be lower than the micro-batch duration (typically 100 ms – a few seconds).

Property	Batch	Micro-batch
Latency	Minutes - hours	Seconds
Throughput	Maximum	High
Semantics	Exactly-once	Exactly-once
API complexity	Simple	Same as batch
Tools	Spark batch	Spark Streaming

For many tasks micro-batch is the ideal choice. Dashboards updating every 5 seconds? Micro-batch. Aggregating metrics for the last minute? Micro-batch. Spark Structured Streaming holds latency at 100 ms - 2 s. But Netflix processes 2 million events per second through Apache Flink with latency in single-digit milliseconds - that requires true stream.

What is the main advantage of micro-batch over true stream processing?

Stream Processing

**Stream processing** - processing each event as it arrives, without waiting. Event arrives - gets processed - result is ready. Latency is measured in milliseconds. Apache Flink leads: Netflix runs 2 million events per second on it for personalization and monitoring. Kafka Streams is Flink's embedded counterpart inside Kafka itself - at LinkedIn it handles 7 trillion messages per day.

The key difference from batch: data is **unbounded** - the stream has no end. There's no 'whole dataset'. You need to define what 'the last 5 minutes' means (windowing), what to do with late-arriving data (watermarks), and how to guarantee correctness on failures.

**Windowing** - the primary aggregation mechanism in streams. Tumbling window (non-overlapping windows: [0-5s], [5-10s], ...), Sliding window (overlapping: [0-5s], [2-7s], ...), Session window (by user activity, with gap timeout).

Property	Micro-batch	True Stream
Latency	100ms - seconds	Milliseconds
Model	Mini-DataFrame	Event by event
Windowing	Limited	Full (tumbling, sliding, session)
Watermarks	No	Yes - managing late-arriving data
Exactly-once	Built-in	Via checkpointing
Tools	Spark Streaming	Flink, Kafka Streams, ksqlDB

What are watermarks used for in stream processing?

Lambda and Kappa Architectures

Combining the advantages of batch (accuracy, completeness) and stream (speed) is the goal of **Lambda Architecture**, proposed by Nathan Marz: two parallel layers - batch (accurate, slow) and speed (fast, approximate). Results are merged at query time.

Jay Kreps (creator of Kafka) proposed an alternative - **Kappa Architecture**: only one stream pipeline. Kafka stores historical data (retention), so the stream can be 'rewound' and re-read from scratch when needed. One pipeline, one logic.

Characteristic	Lambda	Kappa
Number of pipelines	2 (batch + speed)	1 (stream only)
Code duplication	High	None
Complexity	High (merge views)	Medium
Reprocessing	Batch restart	Re-read Kafka from the start
Batch accuracy	Guaranteed	Depends on stream engine
When to choose	ML + real-time; legacy batch	New projects; Flink + Kafka

Modern trend: Kappa is winning. Apache Flink provides exactly-once semantics, Kafka stores data for years (tiered storage). The need for a separate batch layer is shrinking. But for ML training and complex analytics, batch remains optimal.

In practice many systems use a hybrid approach: Kafka for ingestion, Flink for real-time processing, Spark for heavy analytics and ML. It's important to choose the architecture based on specific requirements for latency, accuracy, and complexity.

Batch processing is outdated and unnecessary

Batch remains optimal for tasks that don't require low latency: ETL to a data warehouse, ML model training, report generation, full reindexing. Batch throughput is higher, implementation is simpler, and exactly-once is easier to guarantee

Spark Structured Streaming delivers 100 ms - 2 s latency; Flink delivers single-digit milliseconds. But Flink requires managing state, watermarks, and out-of-order data. PyTorch model training, BigQuery ETL, Uber's nightly reports - all of these are batch, and that's the correct choice. Not 'stream beats batch', but 'different tools for different latency requirements'

The main problem with Lambda Architecture that Kappa solves:

Key Takeaways

**Batch** - bounded dataset, maximum throughput, minutes-hours latency; Spark trains ML models at LinkedIn, Meta, and Airbnb
**Micro-batch** - Spark Structured Streaming, 100 ms - 2 s latency, same DataFrame API as batch; exactly-once built in
**True Stream** - Flink processes 2M events/sec at Netflix with 1-10 ms latency; windowing and watermarks handle out-of-order data
**Lambda** = batch + stream (duplicated logic); **Kappa** = Kafka + Flink only (one pipeline, re-read from start when logic changes)
**Kafka** - 7 trillion messages/day at LinkedIn; not just a broker but persistent storage for the Kappa architecture

Вопросы для размышления

Kafka at LinkedIn handles 7 trillion messages per day. Which tasks from that stream require true stream processing (Flink, ms latency), and which can be handled by micro-batch (Spark, 2 s latency) - and why does the separation matter?
Netflix keeps batch Spark for overnight ML model retraining and Flink for real-time events. Why not go fully Kappa - what makes batch irreplaceable for ML training?
Backpressure is a mechanism where an overloaded consumer signals the producer to slow down. How does Flink implement this in the context of watermarks and windowing?

Связанные уроки

stream-02 — Event-driven architectures are built on stream processing - the natural next step.
stream-03 — Kafka and other brokers are the transport layer for all streaming architectures.
bd-01 — Big Data Volume and Velocity are exactly the environment batch and stream models were designed for.
st-01-feedback-loops — Lambda/Kappa architectures are feedback-loop systems: the batch layer corrects the stream layer, like negative feedback in systems theory.
prob-03-conditional — Watermarks are probabilistic estimates of stream completeness; conditional probability intuition makes them concrete.

Stream Processing

Batch vs Stream Processing

**Kafka (LinkedIn)** - 7 trillion messages per day; Kafka Streams = stream processing without a separate cluster, running inside a Java application
**Flink (Netflix)** - 2 million events per second, single-digit millisecond latency; personalization, monitoring, A/B tests in real time
**Spark Structured Streaming vs Flink**: Spark - micro-batch (100 ms - 2 s, simple API); Flink - true stream (1-10 ms, full control, more complex)

Batch Processing

Characteristic	Value
Data	Bounded dataset
Latency	Minutes - hours
Throughput	Very high
Fault tolerance	Restart task from checkpoint
Tools	Hadoop MapReduce, Spark, Hive, Presto
Use cases	ETL, reports, ML training, data warehouse

What is the main drawback of batch processing?

Micro-batch Processing

Running batch every 2 seconds yields **micro-batch** - a compromise between batch and stream. Data accumulates in small portions and is processed as mini-batches. Latency drops to seconds.

Property	Batch	Micro-batch
Latency	Minutes - hours	Seconds
Throughput	Maximum	High
Semantics	Exactly-once	Exactly-once
API complexity	Simple	Same as batch
Tools	Spark batch	Spark Streaming

What is the main advantage of micro-batch over true stream processing?

Stream Processing

Property	Micro-batch	True Stream
Latency	100ms - seconds	Milliseconds
Model	Mini-DataFrame	Event by event
Windowing	Limited	Full (tumbling, sliding, session)
Watermarks	No	Yes - managing late-arriving data
Exactly-once	Built-in	Via checkpointing
Tools	Spark Streaming	Flink, Kafka Streams, ksqlDB

What are watermarks used for in stream processing?

Lambda and Kappa Architectures

Characteristic	Lambda	Kappa
Number of pipelines	2 (batch + speed)	1 (stream only)
Code duplication	High	None
Complexity	High (merge views)	Medium
Reprocessing	Batch restart	Re-read Kafka from the start
Batch accuracy	Guaranteed	Depends on stream engine
When to choose	ML + real-time; legacy batch	New projects; Flink + Kafka

Batch processing is outdated and unnecessary

The main problem with Lambda Architecture that Kappa solves:

Key Takeaways

**Batch** - bounded dataset, maximum throughput, minutes-hours latency; Spark trains ML models at LinkedIn, Meta, and Airbnb
**Micro-batch** - Spark Structured Streaming, 100 ms - 2 s latency, same DataFrame API as batch; exactly-once built in
**True Stream** - Flink processes 2M events/sec at Netflix with 1-10 ms latency; windowing and watermarks handle out-of-order data
**Lambda** = batch + stream (duplicated logic); **Kappa** = Kafka + Flink only (one pipeline, re-read from start when logic changes)
**Kafka** - 7 trillion messages/day at LinkedIn; not just a broker but persistent storage for the Kappa architecture

Вопросы для размышления

Kafka at LinkedIn handles 7 trillion messages per day. Which tasks from that stream require true stream processing (Flink, ms latency), and which can be handled by micro-batch (Spark, 2 s latency) - and why does the separation matter?
Netflix keeps batch Spark for overnight ML model retraining and Flink for real-time events. Why not go fully Kappa - what makes batch irreplaceable for ML training?
Backpressure is a mechanism where an overloaded consumer signals the producer to slow down. How does Flink implement this in the context of watermarks and windowing?

Связанные уроки

stream-02 — Event-driven architectures are built on stream processing - the natural next step.
stream-03 — Kafka and other brokers are the transport layer for all streaming architectures.
bd-01 — Big Data Volume and Velocity are exactly the environment batch and stream models were designed for.
st-01-feedback-loops — Lambda/Kappa architectures are feedback-loop systems: the batch layer corrects the stream layer, like negative feedback in systems theory.
prob-03-conditional — Watermarks are probabilistic estimates of stream completeness; conditional probability intuition makes them concrete.

Batch vs Stream Processing

Batch Processing

Micro-batch Processing

Stream Processing

Lambda and Kappa Architectures

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки

Batch vs Stream Processing

Batch Processing

Micro-batch Processing

Stream Processing

Lambda and Kappa Architectures

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки