Backend Transport

Batching, Compression, and Zero-Copy

Kafka processes 1 trillion messages per day at LinkedIn. Not through supercomputers - through standard hardware using batching, zero-copy, and OS-level optimizations. A single broker handles 1M messages/sec because of linger.ms, sendfile(), and page cache. The application code barely matters when the I/O stack is optimized.

  • **Kafka** uses sendfile() (Java NIO transferTo) for consumer replay. Log data goes from disk to NIC without any JVM heap allocation - the key to 1M+ messages/sec on a single broker.
  • **Cloudflare** uses XDP/eBPF for DDoS mitigation: processes 26 million packets per second on a single CPU core, blocking attacks before they reach the kernel network stack.
  • **AWS** ML training clusters use EFA (RDMA) for gradient synchronization between GPUs: 1-2 microsecond latency vs 50 microsecond TCP - critical when training LLMs across thousands of GPUs.

Batching and Linger

Batching groups multiple messages into a single network packet. Instead of one syscall per message, the producer accumulates messages and sends them together. linger.ms in Kafka defines the maximum wait time before a batch is sent even if batch.size is not reached.

Kafka uses linger.ms=20 and batch.size=1MB for high-throughput producers. Trade-off is explicit: throughput increases, P99 latency increases by linger.ms. For interactive UIs - linger.ms=0, for analytics pipelines - linger.ms=100+.

Kafka producer with linger.ms=20 vs linger.ms=0. What changes?

Compression: Algorithms and Trade-offs

Compression reduces bytes transferred over the network at the cost of CPU. The choice of algorithm depends on the data type (text vs binary), latency requirements, and whether the service is CPU-bound or network-bound.

0

1

Sign In

nginx gzip_static serves pre-compressed .gz files without runtime CPU cost. Cloudflare compresses at edge, freeing origin servers. For Kafka JSON events, lz4 achieves 3-5x compression with <1ms overhead per batch.

Compression always improves performance - enable it everywhere

Compression is a CPU vs network trade-off. For CPU-bound services or already-compressed data (JPEG, MP4), compression will hurt performance.

Streaming video: nginx serves mp4 files. Enable gzip? No - MP4 is already compressed. gzip would burn CPU without reducing size. sendfile without compression is optimal here.

A Kafka service transmits JSON events (~10KB each). Which compression algorithm is preferred?

Zero-Copy and sendfile()

Zero-copy eliminates redundant data copies between kernel space and user space. Traditional file serving copies data 4 times: disk -> kernel buffer -> user space -> socket buffer -> NIC. sendfile() reduces this to 2 copies: disk -> kernel buffer -> NIC directly.

Kafka's ability to replay consumer reads at disk speed comes from sendfile(). Log data is read by multiple consumers without any JVM allocation. This is why Kafka brokers can sustain 1M+ messages/sec with modest RAM.

Why does Kafka critically depend on sendfile() for high throughput?

io_uring: Asynchronous I/O

io_uring (Linux 5.1, 2019) is an asynchronous I/O interface that uses shared memory rings between kernel and user space. Unlike epoll, which requires a syscall per event, io_uring batches submissions and completions - approaching zero syscall overhead for high-throughput I/O.

io_uring shines at 100K+ connections where epoll syscall overhead becomes measurable. For typical microservices with <10K concurrent connections, the benefit is marginal. Cloudflare reports 15% CPU reduction using io_uring for their edge proxies.

What is the main architectural advantage of io_uring over epoll?

Kernel Bypass: DPDK and RDMA

Kernel bypass skips the entire Linux network stack. Instead of kernel processing each packet, the application receives packets directly from the NIC via userspace drivers. DPDK (Data Plane Development Kit) achieves sub-microsecond latency and 100Gbps throughput on commodity hardware.

DPDK is used in network appliances (routers, firewalls), HFT (high-frequency trading), and 5G packet processing. AWS EFA (RDMA over Ethernet) reduces ML training collective communication from 50us to 2us - critical when synchronizing 10,000+ GPUs.

For which workload is DPDK (kernel bypass) justified?

Summary

  • **Batching** amortizes syscall and network overhead: linger.ms in Kafka, bufferTime in RxJS. Trade-off is explicit: latency increases, throughput increases.
  • **Compression** is beneficial for text data (JSON, protobuf) on network-bound services. lz4/zstd for real-time, brotli/gzip for HTTP. Do not compress already-compressed data (JPEG, MP4).
  • **Zero-copy via sendfile()** eliminates user-space copying: disk -> kernel -> NIC without JVM heap. io_uring goes further: eliminates syscall overhead via shared rings.

Related Topics

Batching and zero-copy optimizations work at all levels of the transport stack:

  • Kafka: Producer and Consumer — Kafka batching (batch.size, linger.ms) and sendfile for consumers - practical applications of these techniques
  • Transport Benchmarking — Measuring the effect of batching and compression requires proper benchmarking with P99 latency and throughput metrics

Вопросы для размышления

  • How to determine the optimal linger.ms for a Kafka producer in a specific application?
  • In which cases does io_uring provide no advantage over epoll?
  • How does batching affect latency percentiles (P50 vs P99) differently?

Связанные уроки

  • net-67-latency-numbers
Batching, Compression, and Zero-Copy