Stream Processing
Stream-Table Duality
In 2016 LinkedIn processed 4.5 trillion Kafka messages per day. They needed to power live counters ("123 people are viewing this job right now") without a separate database. The answer was to build a TABLE directly on top of the stream. Out of that came Kafka Streams and the stream-table duality concept.
- **LinkedIn (2016)**: 4.5T Kafka messages per day, live counters powered by KTable
- **Confluent ksqlDB**: SQL layer over Kafka Streams, used by Lyft, Walmart, Bosch
- **Pinterest**: Apache Flink for materialized views over Kafka, 100+ TB state
- **Uber Athenadb**: TABLE-based materialized views for real-time pricing
- **Yelp Joinery**: stream-stream joins matching orders and payments in real time
Materialized views: tables from streams
Stream-table duality is the central idea behind Kafka Streams and similar systems. Any table can be represented as a stream of its changes, and any key-value stream can be folded into a table by aggregating to the latest value per key. A materialized view is exactly that: a table materialized out of a stream.
**Formally:** table T = aggregate(stream S, by_key, last_value). Stream S = changelog(table T). When the stream is ordered by key, the operations are inverse: table -> stream -> table reproduces the original table.
What is a materialized view in Kafka Streams?
Changelog streams
A changelog stream is a stream of table updates: each message carries a key and a new value (or null for deletion). Kafka log compaction keeps only the latest message per key, so the changelog topic becomes a physical table. Its size is bounded by the number of distinct keys rather than by time.
**Properties:** idempotency (replaying a message replays the latest value, a no-op), serializability (a table can be rebuilt from zero by replaying the changelog), tombstones (key, null) mark deletions and are removed by compaction after retention.
Why does a changelog topic use cleanup.policy=compact instead of delete?
ksqlDB: SQL over streams
ksqlDB is a layer above Kafka Streams that exposes SQL syntax. It allows defining STREAMs and TABLEs, performing joins, aggregations, and windows without writing Java. Internally it compiles to a Kafka Streams topology.
**Pull vs Push query:** a pull query reads the current snapshot from the state store, behaving like a normal SELECT. A push query streams changes in real time. Both run against the same KTable underneath.
Are STREAM and TABLE physically the same thing in ksqlDB?
Streaming joins: stream-stream and stream-table
A stream-table join enriches every event in a stream with data from a table looked up by key. This is the natural fit for enriching transactions with a user profile: the current TABLE value is attached to each event.
A stream-stream join is trickier: it must pair events from two streams that do not arrive simultaneously. The fix is a time window: "if event B arrives within N minutes after event A with the same key, emit a pair." Without a window, the join would be an unbounded growing table.
**Pitfall:** joins require co-partitioning. Both streams must be partitioned on the same key and have the same partition count. Otherwise ksqlDB inserts an auto-repartition step - an expensive shuffle through an intermediate topic.
A TABLE in ksqlDB is a database table, separate from Kafka.
A TABLE is a view backed by a changelog topic. Data lives in Kafka (for durability) and in RocksDB (for query latency). When a worker dies, state is rebuilt from the changelog topic.
There is no external database and no double-storage - that is stream-table duality in action. An engineer who models a TABLE as a PostgreSQL table will start looking for ETL jobs between Kafka and a database that do not exist. Recognizing the unified storage changes the entire system design.
Why does a stream-stream join require a time window?
Key ideas
- Stream-table duality: table = aggregate(stream, by key, last value). Stream = changelog(table). The operations are mutually inverse
- Materialized view = table-from-stream plus a local state store (RocksDB) plus a changelog topic for recovery
- ksqlDB layers SQL syntax over Kafka Streams. STREAM and TABLE differ in semantics over the same topic plus state store mechanism
- Stream-table join = key lookup. Stream-stream join needs a time window or state grows unbounded. Both require co-partitioning
Related topics
Stream-table duality is the central mental model behind all streaming systems; it ties CDC, event sourcing, CQRS, and Kafka Streams into a single architectural picture:
- Change Data Capture (CDC) — the reverse direction of duality: table -> changelog
- Kafka Streams — primary implementation of stream-table duality
- Event Sourcing — stream as source of truth, table as projection
- CQRS — stream for commands, materialized view for queries
Вопросы для размышления
- How would you build an "online users" counter as a KTable? What does the changelog topic look like?
- Given two streams - clicks and impressions - how would you pair them up to compute CTR?
- Why is a stream-stream join without a window a broken design?
- When does ksqlDB stop being enough and the team has to drop down to Kafka Streams in Java?
Связанные уроки
- stream-13 — CDC is the key pattern for understanding stream-table duality
- db-02-relational-model — Table as a special case of a stream with a single state
- prob-17 — Markov chains: stream = Markov transitions, table = stationary state
- stream-15 — Understanding duality opens the path to windowing functions
- aie-08-streaming — LLM streaming is an application of stream-table duality for AI systems