Backend Transport

Transport Benchmarking: Latency vs Throughput

Twitter 2009: 'Fail Whale' appeared every few hours. Infrastructure could not handle the load. Engineers were looking at average response times - everything seemed fine. P99 told a different story: 1% of users got 10-second responses. Without percentile monitoring, the problem was invisible until complete failure.

  • **LinkedIn** publicly described how wrk2 benchmarks found a problem in Kafka producer: P99.9 latency jumped from 2ms to 200ms due to incorrect linger.ms. Ordinary wrk did not catch this.
  • **Google** requires P99 SLO for every production service. 'Error budgets': if P99 exceeds target, the team spends error budget and must freeze new features until fixed.
  • **Netflix** uses Gatling for load testing every deployment. Chaos Engineering (Chaos Monkey) complements this - kills instances in production to verify resilience.

Latency vs Throughput: Different Metrics

Latency measures how long a single request takes. Throughput measures how many requests complete per second. They are related but not inversely proportional. At low utilization, adding load reduces latency (connection reuse). Near saturation, latency spikes exponentially while throughput plateaus.

The 'knee of the curve' (typically 60-80% utilization) is where latency starts growing non-linearly. Google's SRE book recommends capacity planning to keep utilization below 70% to maintain headroom for traffic spikes without P99 degradation.

A system at 70% load has P99=15ms. At 90% - P99=300ms. What is the conclusion?

Percentiles: P50, P95, P99, P99.9

P99 latency means 99% of requests complete within that time. P99=500ms means 1 in 100 requests is slow. Mean latency hides tail latency: if 99 requests complete in 1ms and 1 takes 1000ms, mean=10.99ms but P99=1000ms. Users experience P99, not mean.

At 100 concurrent users, mathematically at least one user experiences P99 latency every second. For frequently visited services, this is constant complaints from 1% of active users. SLOs should be set on P99 or P99.9, never on mean.

Mean latency = 10ms, P99 = 500ms. Which metric to use for the SLO?

Benchmarking Tools

wrk2 is the correct tool for latency benchmarking - it uses constant-rate load with coordinated omission correction. wrk (without the 2) uses open-loop load that hides queuing latency. For gRPC: ghz. For Kafka: kafka-producer-perf-test.sh.

Coordinated omission was identified by Gil Tene (Azul Systems). Most APM systems (Datadog, New Relic) measure response time from when a request is sent, not from when it should have been sent. This systematically understates P99 under load. wrk2 and HdrHistogram correct for this.

wrk shows P99=5ms at 1000 req/s. wrk2 with the same rate shows P99=200ms. Why the difference?

Profiling Bottlenecks

After identifying P99 regression with benchmarking, the next step is finding the bottleneck. Flame graphs show CPU time distribution. async_profiler (Java), py-spy (Python), and perf (Linux) generate flame graphs from running processes without restart.

async-profiler is safe for production use: it uses perf_events and AsyncGetCallTrace (avoids safepoint bias). Never use JVM-level CPU profiling in production without understanding safepoint bias - it shows only code at JVM safepoints, missing GC pauses and I/O wait.

A flame graph shows 60% of time in JSON.parse(). The next step is:

Capacity Planning and SLO

Capacity planning determines how many instances are needed to serve peak load while maintaining SLO. Formula: instances = ceil(peak_rps / max_rps_per_instance) * headroom_factor, where headroom_factor >= 1.3 (30% buffer) and one extra instance for failover.

Google's SRE practice defines error budgets: if P99 SLO is breached, the team freezes new features until the error budget recovers. This creates engineering incentive to maintain reliability alongside feature velocity.

Mean latency is sufficient for monitoring - if the average is good, the system works normally

Mean hides tail latency. P99=500ms with mean=10ms is a real problem for 1% of users.

With 100 concurrent users, mathematically at least one user experiences P99 latency every second. For frequently visited services this means constant complaints from 1% of active users.

An instance maintains P99 < 100ms up to 700 req/s. Expected peak: 3500 req/s. How many instances are needed?

Summary

  • **Percentiles, not mean**: P99 shows the experience of the worst 1% of users - this is the foundation of SLO. Mean hides tail latency.
  • **wrk2, not wrk**: correct benchmarking with a fixed rate eliminates coordinated omission and shows real percentiles.
  • **Utilization < 70%**: keep load below the knee of curve - above 70%, latency grows non-linearly. Capacity = peak * 130% headroom.

Related Topics

Benchmarking measures the effectiveness of optimizations from other lessons:

  • Batching and Zero-Copy — Benchmarking shows the measurable effect of batching and compression - optimizations cannot be confirmed without measurement
  • Distributed Tracing — Tracing complements benchmarking: benchmarks show aggregate latency, traces show exactly where time is spent

Вопросы для размышления

  • How to correctly choose a workload for a benchmark - synthetic test or production traffic replay?
  • How does coordinated omission affect results in most production APM systems?
  • Under what conditions does adding servers not improve P99 latency?

Связанные уроки

  • net-67-latency-numbers
Transport Benchmarking: Latency vs Throughput

0

1

Sign In