Real-Time Backend

Benchmarking

Discord runs hundreds of millions of WebSocket connections. How do its engineers know the system will survive the next 10x growth, and where it will break first?

  • A k6 WebSocket test showed that p99 latency grew from 8ms to 1.2s at 5000 concurrent connections. The bottleneck was synchronous JSON.stringify on large payloads
  • TechEmpower Framework Benchmarks helped Discord pick Elixir/Phoenix Channels as the foundation for the realtime layer: p99 on multi-query was more stable than Node.js at the same resource level
  • uWebSockets.js claims 15M req/s, but a flame graph from a real chat showed 60% of CPU going to business logic and DB, not the WS transport
  • An Artillery.io scenario with Socket.io reproduced a production incident: at 2000 concurrent users a memory leak in an event listener built up 500MB/hour. Without load testing it would only have been spotted in production

C10K and C1M

C10K is the problem of serving 10,000 simultaneous connections on a single server. In 1999, Dan Kegel framed it as the engineering wall of the era: most servers of the time hit the thread-per-connection model and collapsed at 1-2K connections. The shift to event-driven I/O (epoll, kqueue) removed that limit. Today C10K is considered solved.

C1M is the next bar: 1 million concurrent connections. uWebSockets.js claims 15M req/s on synthetic tests; production systems (Slack, Discord) hold millions of WebSocket connections through horizontal scaling and sticky sessions. One Node.js process realistically handles 100K-300K idle WS connections before running into memory limits (~32 KB per connection).

  • C10K (1999): solved by event-loop + epoll/kqueue; relevant only for outdated blocking servers

0

1

Sign In

  • C100K: achievable on a single node with tuned Linux (ulimit, tcp_tw_reuse, SO_REUSEPORT)
  • C1M: horizontal scaling; a single process is bounded by memory and file descriptors
  • Why is C10K no longer a problem for modern servers?

    Latency percentiles

    Average response time (avg) is the worst metric for realtime systems: it hides the distribution tails. If p50 = 5 ms and p99 = 800 ms, the average comes out at ~12 ms and looks fine, until 1% of users start complaining about lag. In production you look at p99 (the 99th percentile) and p99.9 (the 999th). They define the real worst-case experience.

    TechEmpower Framework Benchmarks publishes p99 numbers for hundreds of frameworks under standardized conditions. uWebSockets.js consistently lands in the top 5 by throughput and p99 among Node/JS environments. Comparing different tools directly is incorrect: wrk2 and k6 use different load models.

    For WebSocket systems, latency is measured from the moment the client sends a message until it gets a response (round-trip). k6 lets you write scenarios with ws.send/ws.on and gathers percentiles automatically. Artillery.io is built for Socket.io and HTTP/2 and is convenient for mixed scenarios with think-time between requests.

    A service reports avg latency 10 ms and p99 latency 950 ms. What does this mean?

    Flame graphs

    A flame graph visualizes a CPU profile: the X axis is total time (not chronological), the Y axis is call stack depth. Wide rectangles at the top are the functions consuming the most CPU. Brendan Gregg developed the format at Netflix to diagnose production issues without stopping the service.

    On a realtime server's flame graph, look for: (1) wide JSON.parse/stringify blocks, a sign of excessive serialization; (2) synchronous crypto calls, which block the event loop; (3) unexpected GC frames, which call for --max-old-space-size tuning and fewer allocations; (4) wide blocks in libuv/poll, which are normal for an idle WS server.

    For WebSocket servers under load, an async flame graph is useful (collected via async_hooks). It shows the full chain from incoming message to response, including I/O waits. Tools: clinic.js flame (automatic) or 0x with --kernel-tracing for system calls.

    On a Node.js server's flame graph, a wide JSON.stringify block takes 40% of CPU. What is the conclusion?

    Load testing

    Load testing tools for realtime systems differ in supported protocols and load model. wrk2 is the gold standard for HTTP: it supports constant throughput (unlike wrk's constant arrival rate), which yields correct percentiles. k6 is script-based and supports WebSocket via the ws API, collecting p95/p99 by default. Artillery.io specializes in Socket.io and HTTP/2 and is convenient for stateful realtime scenarios.

    TechEmpower Framework Benchmarks is the industry standard for comparing frameworks: it tests JSON serialization, single DB query, multiple DB queries, plaintext. uWebSockets.js consistently shows 15M+ req/s on plaintext, but that is synthetic. Under real conditions (DB, auth, business logic) the numbers drop by an order of magnitude. Benchmarks help pick a foundation but do not replace profiling under real load.

    1. Define success metrics BEFORE the test: p99 < X ms, throughput > Y req/s, error rate < 0.1%
    2. Warm up the system (warm-up phase): JIT and connection pools must stabilize
    3. Test with a realistic payload: message size, frequency, session patterns
    4. Collect a flame graph during load, not after, not in isolation
    5. Check p99 and p99.9, not just avg/p50

    High throughput in a synthetic benchmark means the system is ready for production load

    Synthetic benchmarks (TechEmpower, uWebSockets.js 15M req/s) test isolated components without real business logic, DB, or auth. Production throughput is usually an order of magnitude lower.

    Synthetics exclude the most expensive operations: DB queries, complex object serialization, JWT validation, inter-service calls. A benchmark is for comparing frameworks under equal conditions, not for forecasting production performance.

    A team is choosing between wrk and k6 for load testing a WebSocket API. Which is preferable?

    Key takeaways

    • C10K is solved by event-driven I/O; C1M requires horizontal scaling. One Node.js process realistically holds 100K-300K idle WS connections
    • Watch p99 and p99.9, not avg: the distribution tails define real user experience in realtime systems
    • A flame graph under load reveals the CPU bottleneck. JSON.stringify, synchronous crypto, and GC pauses are the main optimization candidates
    • wrk2 for HTTP with constant throughput, k6 for WebSocket, Artillery.io for Socket.io scenarios. The right tool for the protocol

    Related topics

    Benchmarking intersects with several key areas of realtime architecture:

    • WebSocket scaling — C1M connections require horizontal scaling of WS servers and sticky sessions
    • Event Loop and non-blocking I/O — Flame graphs on a Node.js server directly reflect event-loop efficiency and the absence of blocking operations
    • Protocol selection — Benchmark results (TechEmpower) are one of the criteria for choosing between HTTP/2, WebSocket, and gRPC

    Вопросы для размышления

    • Which load testing metrics matter most for your system: throughput, p99 latency, or concurrent connection count?
    • How could a flame graph change your approach to optimization compared to code review or intuition?
    • If a synthetic benchmark shows 10x better numbers than production profiling, what does that say about the system's architecture?

    Связанные уроки

    • alg-01-big-o
    Benchmarking