Parallel Computing

Asynchronous Programming

In 1999, Dan Kegel wrote 'The C10K Problem' - arguing that web servers needed to handle 10,000 concurrent connections and the existing thread-per-connection model could not scale there. The solution was event-driven I/O multiplexing with epoll. Today, the bar is C10M: 10 million concurrent connections. Nginx handles this at 10 GB RAM with a single master and 8 worker processes. CloudFlare serves 5+ million concurrent WebSocket connections per server. Every high-performance web server, database, and messaging system is built on async I/O fundamentals covered in this lesson.

**Node.js's event loop** enabled JavaScript to become a server-side language - V8 with libuv's async I/O lets a single Node.js process serve 80,000 HTTP requests per second, making it the foundation of Netflix's API gateway, PayPal's REST services, and LinkedIn's mobile backend.
**Nginx** uses epoll-based event loops with one worker per CPU core to serve over 500 million websites globally, handling 10,000+ concurrent connections per worker process at ~2MB memory each - enabling static file serving at 400,000 req/s on a 4-core server.
**Redis's single-threaded async architecture** uses epoll to handle 1 million ops/second without any locking or thread coordination - proving that for I/O-bound and memory-bound workloads, a single event loop with O(1) I/O multiplexing outperforms multi-threaded designs.

Async/Await and Futures

Async/await syntax transforms asynchronous code into sequential-looking code. An async function returns a Future (Rust), Promise (JavaScript), or Task (Python/C#). When the function awaits a Future, the executor suspends the current task and schedules other work while waiting for the I/O operation to complete, without blocking an OS thread. This enables thousands of concurrent I/O operations on a single thread.

Rust's async model is zero-overhead: async functions compile to state machines that hold all local variables across await points. No heap allocation or dynamic dispatch is needed for simple async functions - the state machine is allocated once on the stack or in a single heap allocation for spawned tasks. This makes Rust async faster than Go goroutines for CPU-bound concurrent code.

asyncio.gather() is concurrent, not parallel: all coroutines run on a single thread (unless using asyncio.to_thread or ProcessPoolExecutor). CPU-bound work inside async functions blocks the event loop - for CPU-bound tasks, use separate processes or threads. asyncio excels exclusively at I/O-bound concurrency.

When an async function awaits an I/O operation, what happens to the OS thread running it?

Event Loops

An event loop is the runtime that drives async code execution. It maintains a list of pending tasks and uses OS notification mechanisms (epoll on Linux, kqueue on macOS) to sleep until at least one I/O event is ready, then wakes up and runs all tasks whose awaited operation completed. The cycle: register interest -> sleep (epoll_wait) -> wake on event -> execute ready tasks -> repeat.

Node.js's single-threaded event loop (libuv) is the canonical example: all I/O is non-blocking, CPU-bound work blocks the loop. This design serves 10,000+ concurrent connections on a single thread for I/O-bound workloads (HTTP APIs, database queries), but a single CPU-bound request (image processing, crypto) blocks all others for its duration.

Python asyncio uses a single-threaded event loop by default. uvloop (libuv-based Cython replacement) is 2-4x faster than asyncio's pure Python event loop for high-throughput servers. FastAPI with uvicorn+uvloop handles 50,000+ req/s on a single core for simple JSON endpoints.

Why does CPU-bound work inside an async function block a Node.js or Python asyncio server for all other requests?

io_uring: Next-Generation I/O

io_uring (Jens Axboe, Linux 5.1, 2019) is a high-performance asynchronous I/O interface that replaces epoll for file and network I/O on Linux. It uses two lock-free ring buffers shared between user space and kernel: a Submission Queue (SQ) for submitting I/O requests and a Completion Queue (CQ) for receiving results. System calls are batched - a single io_uring_enter() can submit and complete hundreds of I/O operations.

io_uring eliminates syscall overhead that limits epoll performance: epoll requires a syscall per wait (epoll_wait) and per registration (epoll_ctl), plus separate read()/write() calls. io_uring submits read requests into the SQ ring, then collects completions from CQ - eliminating the read() syscall entirely. Nginx with io_uring achieves 2x throughput vs. epoll for small files.

io_uring supports fixed buffers (registered with the kernel once, reused without per-operation registration overhead) and kernel poll mode (the kernel proactively polls without any user-space wakeup for extremely low-latency applications). Tokio (Rust async runtime) switched to io_uring on Linux for file I/O in tokio-uring crate.

What is the primary performance advantage of io_uring over epoll?

epoll and Scalable I/O Multiplexing

epoll (Linux 2.6) solves the C10K problem (10,000 concurrent connections): the original select() and poll() system calls require scanning all file descriptors every wake-up, giving O(n) cost for n connections. epoll maintains a kernel-side data structure and returns only the ready file descriptors, giving O(k) cost where k is the number of ready events - enabling 1 million concurrent connections on a single server.

Edge-triggered (EPOLLET) vs. level-triggered (default EPOLLLT) epoll: level-triggered fires whenever the fd is ready (safe but may fire multiple times); edge-triggered fires once per state change (requires non-blocking reads until EAGAIN). Nginx and Redis use edge-triggered mode for maximum throughput, requiring careful non-blocking I/O discipline.

Nginx uses one epoll loop per worker process, with NGINX's worker_processes = number of CPU cores. This avoids thread synchronization while fully utilizing all CPU cores - an event loop architecture that serves 10,000+ req/s per core for static files, scaling linearly with CPU count.

Async/await makes programs faster by parallelizing code execution

Async/await makes I/O-bound programs more efficient by overlapping waiting time - the code still runs sequentially on one thread; parallelism requires multiple threads or processes

A single-threaded event loop serves 1 million HTTP connections not because it runs them in parallel but because it overlaps their I/O wait times - a connection waiting for a database query does not block the thread from serving another request that is ready to respond

Why does epoll scale to 1 million connections when select() cannot?

Key Ideas

**Async/await** transforms asynchronous I/O into sequential-looking code; the executor suspends tasks at await points and runs other ready tasks on the same thread - enabling thousands of concurrent I/O operations without threads.
**epoll** solves C10K by returning only ready file descriptors in O(k) time vs. select()'s O(n) scan - the foundation of Nginx, Redis, and every high-performance server on Linux.
**io_uring** eliminates per-operation syscall overhead via ring buffer batching, achieving 2.3x more IOPS than epoll for NVMe-class storage and 30-50% higher HTTP throughput for small files.

Вопросы для размышления

A Python FastAPI service is slow under load. Profiling shows 80% of CPU time is in image resizing inside an async endpoint. What is the root cause and what are the correct fixes?
Design an async HTTP client that makes 10,000 API requests with a rate limit of 100 req/s. What async primitives would be used to implement the rate limiting?
When would a blocking thread-per-connection server outperform an async event loop server, despite the async server's lower per-connection overhead?

Связанные уроки

os-03-threads

Parallel Computing

Asynchronous Programming

**Node.js's event loop** enabled JavaScript to become a server-side language - V8 with libuv's async I/O lets a single Node.js process serve 80,000 HTTP requests per second, making it the foundation of Netflix's API gateway, PayPal's REST services, and LinkedIn's mobile backend.
**Nginx** uses epoll-based event loops with one worker per CPU core to serve over 500 million websites globally, handling 10,000+ concurrent connections per worker process at ~2MB memory each - enabling static file serving at 400,000 req/s on a 4-core server.
**Redis's single-threaded async architecture** uses epoll to handle 1 million ops/second without any locking or thread coordination - proving that for I/O-bound and memory-bound workloads, a single event loop with O(1) I/O multiplexing outperforms multi-threaded designs.

Async/Await and Futures

When an async function awaits an I/O operation, what happens to the OS thread running it?

Event Loops

Why does CPU-bound work inside an async function block a Node.js or Python asyncio server for all other requests?

io_uring: Next-Generation I/O

What is the primary performance advantage of io_uring over epoll?

epoll and Scalable I/O Multiplexing

Async/await makes programs faster by parallelizing code execution

Async/await makes I/O-bound programs more efficient by overlapping waiting time - the code still runs sequentially on one thread; parallelism requires multiple threads or processes

Why does epoll scale to 1 million connections when select() cannot?

Key Ideas

**Async/await** transforms asynchronous I/O into sequential-looking code; the executor suspends tasks at await points and runs other ready tasks on the same thread - enabling thousands of concurrent I/O operations without threads.
**epoll** solves C10K by returning only ready file descriptors in O(k) time vs. select()'s O(n) scan - the foundation of Nginx, Redis, and every high-performance server on Linux.
**io_uring** eliminates per-operation syscall overhead via ring buffer batching, achieving 2.3x more IOPS than epoll for NVMe-class storage and 30-50% higher HTTP throughput for small files.

Вопросы для размышления

A Python FastAPI service is slow under load. Profiling shows 80% of CPU time is in image resizing inside an async endpoint. What is the root cause and what are the correct fixes?
Design an async HTTP client that makes 10,000 API requests with a rate limit of 100 req/s. What async primitives would be used to implement the rate limiting?
When would a blocking thread-per-connection server outperform an async event loop server, despite the async server's lower per-connection overhead?

Связанные уроки

os-03-threads

Asynchronous Programming

Async/Await and Futures

Event Loops

io_uring: Next-Generation I/O

epoll and Scalable I/O Multiplexing

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Asynchronous Programming

Async/Await and Futures

Event Loops

io_uring: Next-Generation I/O

epoll and Scalable I/O Multiplexing

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки