Real-Time Backend
Kafka for real-time
How does LinkedIn push a trillion messages per day without losing any, and what does a plain text file on disk have to do with it?
- LinkedIn handles 1T+ messages per day through Kafka: user activity, clicks, feed updates. Everything flows through a single event log
- Uber uses Kafka Streams to compute surge pricing in real time: trip events → supply/demand aggregation by area → dynamic tariff coefficient in seconds
- Cloudflare logs 10M requests per second through Kafka: edge nodes write events while WAF and analytics read independently without blocking
- Netflix uses Kafka to track watch progress. Even if a consumer group crashes, data is preserved. The service catches up on missed events from retention
Kafka Realtime
Apache Kafka is a distributed event log that stores messages on disk and delivers them to consumers in write order. Unlike RabbitMQ or Google Pub/Sub, messages are not removed after read. They live as long as `retention.ms` says. This lets several independent services read the same data stream at different times.
LinkedIn, where Kafka was born, pushes more than 1 trillion messages per day through it: user activity, clicks, feed updates. Cloudflare uses Kafka to log 10 million requests per second: each edge node writes events to a topic, analytics and the WAF engine read independently without blocking each other.
Key difference from Pub/Sub: **retention**. Kafka keeps history. You can rewind to any offset and replay events. Critical for audit, replay-testing of new services, and recovery after incidents.
Cloudflare runs a WAF engine and a new ML anomaly detector through Kafka. Both read the `requests` topic. If the ML detector goes down for 2 hours and then comes back, what happens?
Consumer Lag
Consumer lag is the gap between the last written offset (log end offset) and the offset already committed by the consumer group. If the producer writes 50,000 events/sec and the consumer processes 40,000, lag grows by 10,000 events per second. When this happens at Uber during surge pricing, the delay between a trip and the tariff recalculation becomes user-visible.
Partition 1 is behind by 35,000 events. Most likely the consumer assigned to it is overloaded or hung. Monitoring lag through Prometheus + the `kafka_consumer_group_lag` metric lets you alert: if lag stays above N for K seconds, page the on-call engineer.
- **Slow processing**: the consumer spends more time per message than the producer takes to generate the next one
- **Consumer crash**: one instance died, rebalance took several seconds, lag grew in the meantime
- **Traffic spike**: a burst of incoming events exceeded the consumer group's throughput
- **GC pause**: a long Stop-the-World pause in a JVM consumer froze processing for hundreds of milliseconds
Uber monitors the lag of consumer group `surge-pricing`. At 18:00 lag grew from 500 to 80,000 in 5 minutes and keeps rising. What does this mean?
Partitioning
A partition is the unit of parallelism in Kafka. One partition is read by exactly one consumer of a group at a time. If topic `ride-events` has 12 partitions, a consumer group of 12 instances reads in parallel, one partition each. A 13th instance sits idle: there are no partitions left.
The partition key decides which partition a message lands in. Uber shards by `driver_id`: every event of one driver is guaranteed to land in one partition, so they are processed in arrival order by one consumer. Critical for correct route and trip-cost computation.
**Partition count can only go up, never down.** Increasing partitions can move the same partition key to a different partition (hash % new_count changes), which breaks ordering guarantees. Plan partitioning at topic creation, with headroom.
Uber uses `driver_id` as the partition key. Why is that specific choice critical for surge pricing?
Kafka Tuning
Kafka exposes three producer durability options through the `acks` parameter. With `acks=0` the producer waits for no acknowledgement: max throughput, zero guarantees. With `acks=1` the leader broker confirms the write, but if it dies before replication, the message is gone. With `acks=all` every ISR (in-sync replica) confirms the write: full durability, higher latency.
- **`linger.ms`**: time to wait before sending a batch. Raising from 0 to 5ms boosts throughput 5-10x with minimal latency cost
- **`batch.size`**: max batch size in bytes. Default 16KB; for high-load topics, 64-256KB
- **`compression.type`**: snappy cuts volume by 50-70% at minimal CPU cost; lz4 is faster; gzip is denser but slower
- **`min.insync.replicas`**: minimum replicas needed for ack at acks=all. 2 of 3 replicas is the production standard
More partitions always means better performance
Each partition is a file on disk and a TCP connection. Too many partitions increase failover latency and load on ZooKeeper/KRaft
On broker failure Kafka reassigns leaders for every partition on that broker. 10,000 partitions per broker means 10,000 reassignment operations. Failover can take tens of seconds. LinkedIn recommends no more than 4,000 partitions per broker and 200,000 across the cluster.
Stripe processes payment events through Kafka. Which combination of `acks` and `idempotent` should the `payment-transactions` topic use?
Summary
- Kafka stores messages as an append-only log with a retention period. Multiple consumer groups read one topic independently, each with its own offset
- Consumer lag is the gap between what is written and what the group has read. Growing lag means the consumer can't keep up with the producer
- Partition count determines max parallelism: one consumer per group, one partition. The partition key guarantees ordering for one entity's events
- For financial data reliability: acks=all + idempotent=true. For analytics logs: acks=1 + large batches + compression
Related topics
Kafka fits into the wider real-time architecture patterns
- Horizontal scaling — Kafka partitioning is a concrete implementation of sharding for event streams
- Event-driven architecture — Kafka is the main transport layer for event sourcing and CQRS patterns
- Monitoring and observability — Consumer lag is a key health-check metric for a real-time pipeline
Вопросы для размышления
- If you had to choose between Kafka and Redis Pub/Sub for an analytics system with 10+ independent data consumers, what factor would be decisive?
- Uber shards events by driver_id. What happens to ordering guarantees if tomorrow you add partitions to the topic without data migration?
- In which scenario is a high consumer lag normal rather than alert-worthy?