Big Data

Stream Processing Patterns

During the 2013 Super Bowl finale, Twitter processed 24 million tweets per hour. The real question was not 'how many tweets total' but 'how many per second right now, and is that rising or falling'. Answering that requires sliding windows over a live stream - batch processing the night's logs gives the right number 8 hours too late.

  • **Uber** calculates surge pricing every 5 minutes through tumbling windows: requests / available drivers per geo-hexagon - when the ratio crosses a threshold, the price rises
  • **LinkedIn** uses windowed stream joins: profile views join with current profile data for real-time feed analytics capturing the exact profile snapshot at view time
  • **Confluent** (Kafka creators) estimates that up to 30% of production incidents in stream processing trace back to incorrect late-data handling - watermarks are most often configured too aggressively

Windowing

An infinite data stream cannot be aggregated without a boundary - there must be a 'frame' inside which averages, sums, and maximums are calculated. A window is exactly that frame: a bounded subset of the stream over which an aggregation runs. The windowing paradox: Uber calculates surge pricing every 5 minutes across all rides within a 2 km radius - a tumbling time window and a spatial window simultaneously, 100,000 parallel windows in real time.

Three window types: **Tumbling** - fixed non-overlapping intervals (0:00-1:00, 1:00-2:00). Each event falls into exactly one window. **Sliding** - windows with a step smaller than their size (10-min window, 1-min step): an event falls into multiple windows, computationally expensive. **Session** - windows defined by activity: close after N seconds of silence. Ideal for user sessions. Fourth type: **Global window** - all events in one window with a custom trigger (by event count or external signal).

The goal is counting errors per 5-minute interval, with each error falling into exactly one bucket. Which window type fits?

Stream Joins

JOINing two infinite streams is impossible without constraints: there is no way to wait for 'all' records from the second stream because they arrive indefinitely. Frameworks address this with windowed joins: combine records from two streams that land in the same time window. LinkedIn uses stream joins in real time: a stream of profile views (left) joins with updated profile data (right) so that analytics captures the exact profile version at the time of the view.

Stream join types: **Windowed join** - both sides within the same time window; both are bounded. **Stream-table join** (enrichment) - event stream + slowly changing table (lookup); the table is cached in state. **Temporal join** - a stream record joins the version of a table that was current at event time (point-in-time join). Kafka Streams supports all three. Cost increases left to right: windowed is cheap, temporal is expensive.

A transaction stream needs enrichment with customer data (name, email) from a database. The database changes rarely; enrichment latency is not critical. Which join?

Deduplication

At-least-once delivery - the standard for most streaming systems - guarantees a message will arrive but not that it arrives exactly once. For financial transactions or ad impression counters, duplication is unacceptable. Deduplication on a stream is harder than in batch: the system must track 'have we seen this event before', but cannot store infinite history - only a reasonable time window.

Stream deduplication strategies: **Windowed dedup** - keep a set of IDs for the last N minutes, drop repeats (works if duplicates arrive close in time). **Bloom filter** - probabilistic structure: 0 false negatives, low false positives; saves memory at the cost of rare 'misses' (accepting a duplicate). **RocksDB state store** - persistent K/V store in Flink/Kafka Streams for seen IDs with TTL. **Idempotent write** - not dedup in the stream, but an idempotent operation at the sink (upsert instead of insert).

A system counts ad impressions. Duplicates must be eliminated but errors below 0.1% are acceptable (rare duplicate slips through). Which data structure is optimal?

Late Data and Watermarks

Events in a stream have two timestamps: event time (when something happened) and processing time (when the broker received it). A mobile user goes underground with no signal - clicks accumulate in the app buffer for 20 minutes. When connectivity returns, 50 events arrive simultaneously with event times 20 minutes in the past. A system running on processing time counts them in the 'wrong' window. A watermark is the system's declaration: 'all events with event time before T have been received - windows before T can close'.

**Watermark** advances as events arrive: `watermark = max_event_time_seen - max_lateness_tolerance`. A larger tolerance means windows close later, more late data is accepted, and result latency is higher. **Allowed lateness** (Flink) - additional delay after the watermark before final window closure: accept events for N more seconds after 'closing'. **Side output** - events that arrive after allowed lateness go to a separate stream for handling or alerting.

A larger watermark tolerance is always better - it accepts more late data

Watermark tolerance is a direct tradeoff between completeness (more data accepted) and latency (windows close later and results arrive later)

A 5-minute tolerance means results appear 5 minutes late. In real-time systems this is often unacceptable. The right tolerance equals the maximum delay the business can absorb.

Watermark tolerance is 30 seconds. An event arrives with an event time 45 seconds behind the current watermark. What happens to it?

Key Ideas

  • **Windowing** turns an infinite stream into finite aggregates: tumbling for exact periods, sliding for rolling metrics, session for user activity - the business requirement determines the window type
  • **Stream joins** are bounded by time windows or state-cached tables; stream-table join is the gold standard for enriching events with reference data without temporal constraints
  • **Watermarks** drive event-time progress in a stream: tolerance is a direct latency-vs-completeness tradeoff; events arriving after allowed lateness are permanently lost without side output

Related Topics

Stream processing patterns depend on broker infrastructure and analytical storage:

  • Real-time Analytics — Windowing and join results land in OLAP engines (Druid, ClickHouse) for interactive queries
  • Message Brokers: Why and When — Kafka is the typical source and sink for stream processing; Kafka's at-least-once semantics require deduplication at the processing layer

Вопросы для размышления

  • A session window closes after N minutes of silence. But what if a user pauses for 35 minutes in the middle of a real visit? How do you distinguish the end of a session from temporary inactivity?
  • A Bloom filter for deduplication gives 0.1% false positives. At one billion ad impressions per day that is millions of accepted duplicates - how do you weigh the memory cost of a HashSet against the revenue impact of missed duplicates for a specific business?
  • A watermark tolerance of 30 seconds means results arrive 30 seconds late. Under what conditions would it make sense to run two parallel pipelines - a fast one with an aggressive watermark and a slower one with a larger tolerance for correctness?

Связанные уроки

  • ds-01-intro
Stream Processing Patterns

0

1

Sign In