Operating Systems
Advanced I/O
Nginx handles 100,000 connections on a single server. PostgreSQL performs millions of transactions per second. Redis returns data with latency <1ms. The secret? Advanced I/O techniques: async I/O, zero-copy, direct I/O. This is the difference between an application that chokes at 1000 RPS and one that scales to millions.
- **High-performance servers:** io_uring in Nginx, lighttpd - up to 3x less CPU for the same throughput. sendfile() for static files - the foundation of CDN and web servers.
- **Databases:** PostgreSQL, MySQL InnoDB use O_DIRECT for WAL and data files. This avoids double buffering (page cache + buffer pool) and provides predictable latency.
- **Streaming and Big Data:** Kafka, ClickHouse use zero-copy (sendfile) for data transfer between partitions and clients. Savings: up to 70% CPU under high loads.
- **Cloud Storage:** MinIO, Ceph use Direct I/O + io_uring for maximum NVMe SSD utilization. 1M IOPS on a single server - the result of proper I/O.
Цели урока
- Distinguish select/poll/epoll/kqueue and their complexity in the number of FDs
- io_uring (Linux 5.1, 2019): shared ring buffer, batching, zero syscalls on the hot path
- Zero-copy: sendfile, splice, MSG_ZEROCOPY for network transfer
- Direct I/O (O_DIRECT): bypassing the page cache, 512B/4KB alignment, DB use cases
- Apply async I/O in high-throughput servers (Nginx, Redis, ScyllaDB)
Asynchronous I/O
**Asynchronous I/O** (async I/O) allows an application to continue execution while the I/O operation is processed in the background. Unlike blocking I/O, the thread does not get stuck waiting for the operation to complete.
**Three I/O models:** • **Blocking I/O** - the thread is blocked until the operation completes • **Non-blocking I/O** - the operation immediately returns control, requires polling • **Async I/O** - the operation is performed in the background, the application receives a notification upon completion
**Linux AIO (Asynchronous I/O)** - an old asynchronous I/O mechanism in Linux. Works through system calls `io_submit()`, `io_getevents()`. Has limitations: works only with O_DIRECT, complex API, not all operations are truly asynchronous.
Why async I/O
**Web server with 10000 connections:** • Blocking I/O: requires 10000 threads → huge overhead on context switch • Non-blocking + epoll: one thread handles all connections, but many system calls are needed • Async I/O (io_uring): submit batch requests, receive batch results - minimal system calls
**Problem with Linux AIO:** operations with buffered cache (page cache) still block. Asynchronous behavior is guaranteed only for Direct I/O (O_DIRECT), which is inconvenient for many applications.
**Applications of async I/O:** • **Databases** - parallel disk queries • **Web servers** - handling thousands of connections with one thread • **File servers** - simultaneous handling of multiple file operations • **High-frequency trading** - minimizing latency is critical
What is the key advantage of async I/O over non-blocking I/O with epoll?
io_uring - Revolution in Linux I/O
**io_uring** - a modern asynchronous I/O mechanism in Linux (since kernel 5.1, 2019). It is a complete overhaul of the I/O approach: instead of separate system calls, **ring buffers** in shared memory between the kernel and application are used.
**Revolutionary features of io_uring:** • **Zero system calls** in the hot path - the application and kernel communicate through shared memory • **Batch operations** - submit many requests at once, receive many results • **Any operations are asynchronous** - not only I/O, but also accept, connect, fsync, even openat • **Polished operation chaining** - operations can be linked without returning to userspace
**SQPOLL mode** - the kernel allocates a separate thread that constantly checks the Submission Queue. The application does not need system calls for submit - just write to shared memory, and the kernel thread processes it immediately.
Real-world example
**Nginx with io_uring (experimental support):** Benchmark showed a performance increase of up to 30-40% under high loads (100k+ req/s) due to: • Fewer context switches • Batch request processing • Fewer system calls (from ~4 per request to ~1)
**Linked operations** - operations can be chained: open file → read → close. Everything is executed in the kernel without returning to userspace between operations.
**io_uring today:** • **RocksDB** - uses io_uring to speed up compaction • **ScyllaDB** - a database initially designed for io_uring • **QEMU** - virtualization with io_uring for disk I/O • **liburing** - official library from the author of io_uring (Jens Axboe)
Why can io_uring operate without system calls in the hot path?
Zero-Copy I/O
**Zero-copy** - techniques for transferring data between files, sockets, and applications without copying to userspace. Regular path: disk → kernel buffer → user buffer → kernel socket buffer → network. Zero-copy: disk → kernel buffer → network.
**sendfile()** - a system call for zero-copy file transfer to a socket. Data is transferred within the kernel, bypassing userspace. Ideal for web servers serving static files.
Web servers
**Nginx: static file 1 MB** • Without sendfile: read() + write() → ~1000 system calls, ~2-3 MB copying • With sendfile: 1 call, 0 copies in userspace Performance gain: up to 70% under high loads.
**splice()** - a more flexible zero-copy mechanism. Transfers data between a file descriptor and a pipe, or between two pipes. Complex data pipelines can be built within the kernel.
**mmap() + write()** - an alternative approach to zero-copy. Map the file into the process's memory (memory-mapped I/O), then write to the socket. Data is copied only once: page cache → socket buffer.
**Scatter-Gather I/O (vectored I/O):** readv()/writev() allow reading/writing to multiple buffers with one call. Combined with zero-copy, efficient pipelines can be built without unnecessary copying.
File servers
**Samba file server:** When transferring files to Windows clients, uses sendfile() for zero-copy. On files >1GB, CPU usage difference: ~80% → ~20%. Freed resources can be used to handle more clients.
What is the main advantage of sendfile() over classic read() + write()?
Direct I/O and Cache Bypass
**Direct I/O (O_DIRECT)** - a mode of file operation where data is transferred directly between the application and disk, bypassing the kernel's page cache. This gives the application full control over caching.
**When O_DIRECT is needed:** • **Databases** - own buffer pool, page cache only interferes (double buffering) • **Streaming** - data is read once, caching is pointless • **Disk performance measurement** - page cache distorts results • **Priority management** - the application decides what to cache
**PostgreSQL and O_DIRECT:** PostgreSQL can operate in O_DIRECT mode for WAL (Write-Ahead Log). This is critical for durability: data goes directly to disk, bypassing the page cache, which can be lost on crash.
Real-world case
**MySQL InnoDB:** By default, uses O_DIRECT for data files: ```ini innodb_flush_method = O_DIRECT ``` Reasons: • InnoDB has its own buffer pool (cache) • page cache only consumes memory and creates overhead • With O_DIRECT: 100GB buffer pool + 0 page cache vs Buffered I/O: 100GB buffer pool + 50GB page cache (double RAM usage)
**O_SYNC vs O_DIRECT vs fdatasync():** • **O_SYNC** - each write() waits for physical write to disk (slow) • **O_DIRECT** - bypasses page cache, but does not guarantee durability (may be in disk cache) • **fdatasync()** - flushes data from page cache to disk, but not metadata • **fsync()** - flushes everything: data + metadata
**Alignment requirements for O_DIRECT:** • **Offset** - multiple of logical block size (usually 512 or 4096 bytes) • **Size** - multiple of logical block size • **Buffer address** - aligned to boundary (512/4096 bytes) Use `posix_memalign()` or `aligned_alloc()` for buffer allocation.
Durability vs Performance
**Redis persistence:** Redis AOF (Append-Only File) can use fsync() after each write for maximum durability: ``` appendfsync always → fsync() after each command (slow but reliable) appendfsync everysec → fsync() once a second (balance) appendfsync no → OS decides (fast but risk of data loss) ``` For critical data, use `always` + O_DIRECT in some scenarios.
O_DIRECT makes operations faster, so it should be used everywhere
O_DIRECT is useful only when the application manages caching itself. For regular cases, buffered I/O is more efficient
The kernel's page cache contains many optimizations: read-ahead, write coalescing, lazy write-back. O_DIRECT disables all this. For databases with their own buffer pool, this is a plus (no duplication), but for a regular application - a minus (loss of optimizations). Incorrect use of O_DIRECT can lead to a 10x performance drop due to loss of read-ahead and small unaligned requests.
Why do databases often use O_DIRECT instead of buffered I/O?
Key Ideas
- **Async I/O** (io_uring) allows an application to continue working during I/O operations. Instead of system calls, ring buffers in shared memory are used - up to 0 syscalls in the hot path.
- **io_uring** - a revolution in Linux I/O (kernel 5.1+). Submission Queue (SQ) for requests, Completion Queue (CQ) for results. Batch operations, SQPOLL mode, linked operations. Used in RocksDB, ScyllaDB, QEMU.
- **Zero-copy** (sendfile, splice, mmap) transfers data within the kernel without copying to userspace. Critical for web servers (static content), file servers, streaming. Savings: up to 70% CPU on high throughput.
- **Direct I/O (O_DIRECT)** bypasses the page cache, giving control to the application. Databases use it to avoid double buffering. Requires alignment of buffers and offsets (512/4096 bytes). Combination of O_DIRECT + io_uring = maximum performance.
Related Topics
Advanced I/O is part of the ecosystem of system performance and reliability:
- I/O Scheduling — I/O schedulers (CFQ, Deadline, mq-deadline, Kyber) determine the order of disk request execution. io_uring operates on top of the I/O scheduler.
- File Systems — File systems (ext4, XFS, Btrfs) affect the performance of Direct I/O. Journaling, COW, extent allocation - all interact with O_DIRECT.
- Memory Management — The page cache is part of the memory management subsystem. Direct I/O requires understanding of DMA, pinned pages, TLB.
- Concurrency — Async I/O allows building highly parallel systems. Event loops (epoll, io_uring) are the foundation for async/await, futures, coroutines.
Вопросы для размышления
- Why can io_uring with SQPOLL mode operate without system calls? What is the trade-off of this approach?
- When is sendfile() inefficient? Provide an example scenario where classic read() + write() would be faster.
- A database needs to write 1000 records of 4KB each. What is more efficient: 1000 operations with O_DIRECT or one large write with buffered I/O? Why?
- How can io_uring linked operations help in implementing an HTTP server? What operations can be linked?