Parallel Computing

Asynchronous Programming

In 1999, Dan Kegel wrote 'The C10K Problem' - arguing that web servers needed to handle 10,000 concurrent connections and the existing thread-per-connection model could not scale there. The solution was event-driven I/O multiplexing with epoll. Today, the bar is C10M: 10 million concurrent connections. Nginx handles this at 10 GB RAM with a single master and 8 worker processes. CloudFlare serves 5+ million concurrent WebSocket connections per server. Every high-performance web server, database, and messaging system is built on async I/O fundamentals covered in this lesson.

  • **Node.js's event loop** enabled JavaScript to become a server-side language - V8 with libuv's async I/O lets a single Node.js process serve 80,000 HTTP requests per second, making it the foundation of Netflix's API gateway, PayPal's REST services, and LinkedIn's mobile backend.
  • **Nginx** uses epoll-based event loops with one worker per CPU core to serve over 500 million websites globally, handling 10,000+ concurrent connections per worker process at ~2MB memory each - enabling static file serving at 400,000 req/s on a 4-core server.
  • **Redis's single-threaded async architecture** uses epoll to handle 1 million ops/second without any locking or thread coordination - proving that for I/O-bound and memory-bound workloads, a single event loop with O(1) I/O multiplexing outperforms multi-threaded designs.

Async/Await and Futures

Async/await syntax transforms asynchronous code into sequential-looking code. An async function returns a Future (Rust), Promise (JavaScript), or Task (Python/C#). When the function awaits a Future, the executor suspends the current task and schedules other work while waiting for the I/O operation to complete, without blocking an OS thread. This enables thousands of concurrent I/O operations on a single thread.

Rust's async model is zero-overhead: async functions compile to state machines that hold all local variables across await points. No heap allocation or dynamic dispatch is needed for simple async functions - the state machine is allocated once on the stack or in a single heap allocation for spawned tasks. This makes Rust async faster than Go goroutines for CPU-bound concurrent code.

asyncio.gather() is concurrent, not parallel: all coroutines run on a single thread (unless using asyncio.to_thread or ProcessPoolExecutor). CPU-bound work inside async functions blocks the event loop - for CPU-bound tasks, use separate processes or threads. asyncio excels exclusively at I/O-bound concurrency.

When an async function awaits an I/O operation, what happens to the OS thread running it?

Event Loops

An event loop is the runtime that drives async code execution. It maintains a list of pending tasks and uses OS notification mechanisms (epoll on Linux, kqueue on macOS) to sleep until at least one I/O event is ready, then wakes up and runs all tasks whose awaited operation completed. The cycle: register interest -> sleep (epoll_wait) -> wake on event -> execute ready tasks -> repeat.

Node.js's single-threaded event loop (libuv) is the canonical example: all I/O is non-blocking, CPU-bound work blocks the loop. This design serves 10,000+ concurrent connections on a single thread for I/O-bound workloads (HTTP APIs, database queries), but a single CPU-bound request (image processing, crypto) blocks all others for its duration.

Python asyncio uses a single-threaded event loop by default. uvloop (libuv-based Cython replacement) is 2-4x faster than asyncio's pure Python event loop for high-throughput servers. FastAPI with uvicorn+uvloop handles 50,000+ req/s on a single core for simple JSON endpoints.

Why does CPU-bound work inside an async function block a Node.js or Python asyncio server for all other requests?

io_uring: Next-Generation I/O

io_uring (Jens Axboe, Linux 5.1, 2019) is a high-performance asynchronous I/O interface that replaces epoll for file and network I/O on Linux. It uses two lock-free ring buffers shared between user space and kernel: a Submission Queue (SQ) for submitting I/O requests and a Completion Queue (CQ) for receiving results. System calls are batched - a single io_uring_enter() can submit and complete hundreds of I/O operations.

io_uring eliminates syscall overhead that limits epoll performance: epoll requires a syscall per wait (epoll_wait) and per registration (epoll_ctl), plus separate read()/write() calls. io_uring submits read requests into the SQ ring, then collects completions from CQ - eliminating the read() syscall entirely. Nginx with io_uring achieves 2x throughput vs. epoll for small files.

io_uring supports fixed buffers (registered with the kernel once, reused without per-operation registration overhead) and kernel poll mode (the kernel proactively polls without any user-space wakeup for extremely low-latency applications). Tokio (Rust async runtime) switched to io_uring on Linux for file I/O in tokio-uring crate.

What is the primary performance advantage of io_uring over epoll?

epoll and Scalable I/O Multiplexing

epoll (Linux 2.6) solves the C10K problem (10,000 concurrent connections): the original select() and poll() system calls require scanning all file descriptors every wake-up, giving O(n) cost for n connections. epoll maintains a kernel-side data structure and returns only the ready file descriptors, giving O(k) cost where k is the number of ready events - enabling 1 million concurrent connections on a single server.

Edge-triggered (EPOLLET) vs. level-triggered (default EPOLLLT) epoll: level-triggered fires whenever the fd is ready (safe but may fire multiple times); edge-triggered fires once per state change (requires non-blocking reads until EAGAIN). Nginx and Redis use edge-triggered mode for maximum throughput, requiring careful non-blocking I/O discipline.

Nginx uses one epoll loop per worker process, with NGINX's worker_processes = number of CPU cores. This avoids thread synchronization while fully utilizing all CPU cores - an event loop architecture that serves 10,000+ req/s per core for static files, scaling linearly with CPU count.

Async/await makes programs faster by parallelizing code execution

Async/await makes I/O-bound programs more efficient by overlapping waiting time - the code still runs sequentially on one thread; parallelism requires multiple threads or processes

A single-threaded event loop serves 1 million HTTP connections not because it runs them in parallel but because it overlaps their I/O wait times - a connection waiting for a database query does not block the thread from serving another request that is ready to respond

Why does epoll scale to 1 million connections when select() cannot?

Key Ideas

  • **Async/await** transforms asynchronous I/O into sequential-looking code; the executor suspends tasks at await points and runs other ready tasks on the same thread - enabling thousands of concurrent I/O operations without threads.
  • **epoll** solves C10K by returning only ready file descriptors in O(k) time vs. select()'s O(n) scan - the foundation of Nginx, Redis, and every high-performance server on Linux.
  • **io_uring** eliminates per-operation syscall overhead via ring buffer batching, achieving 2.3x more IOPS than epoll for NVMe-class storage and 30-50% higher HTTP throughput for small files.

Related Topics

Async I/O connects to concurrency models and system-level programming:

  • CSP and Channels: Go — Go's goroutine scheduler implements async I/O internally via epoll/kqueue, hiding event loop complexity behind goroutine abstraction - each blocking goroutine maps to a non-blocking I/O operation under the hood
  • Concurrency at Scale — Backpressure and rate limiting in distributed systems use async primitives (semaphores, bounded channels) to prevent event loop saturation under bursty load

Вопросы для размышления

  • A Python FastAPI service is slow under load. Profiling shows 80% of CPU time is in image resizing inside an async endpoint. What is the root cause and what are the correct fixes?
  • Design an async HTTP client that makes 10,000 API requests with a rate limit of 100 req/s. What async primitives would be used to implement the rate limiting?
  • When would a blocking thread-per-connection server outperform an async event loop server, despite the async server's lower per-connection overhead?

Связанные уроки

  • os-03-threads
Asynchronous Programming

0

1

Sign In