Operating Systems
System Calls
Every second, a running program makes thousands of system calls - even a simple printf() eventually calls write(). But how does a user program request the kernel to perform an operation if it is not allowed to access hardware? How does the CPU switch between user mode and kernel mode? Why do some syscalls take 150 nanoseconds, while others take 20? Behind this lies the fundamental mechanism of interaction with the operating system.
- **Production optimization:** In high-load systems (NGINX, Redis, PostgreSQL), the number of syscalls is a key metric. strace shows bottlenecks: excessive open/close, unnecessary read/write. Batching syscalls through io_uring gives a x2-x3 boost.
- **Observability:** strace, perf, eBPF - the foundation of debugging production issues. A 100ms slowdown can be caused by a single slow syscall (e.g., fsync()). Understanding syscall overhead is critical for latency analysis.
- **Security:** Seccomp (Secure Computing Mode) allows limiting the set of syscalls for a process - this is the basis of sandboxing (Docker, Chrome, systemd). Each syscall is a potential attack point, the kernel must validate all arguments.
Цели урока
- Explain the mode switch: int 0x80 / syscall instruction, dispatch via the IDT
- Know the cost: ~100ns on modern CPUs post-Meltdown, ~50ns without KPTI
- Understand vDSO: gettimeofday/clock_gettime without a syscall via a mapped page
- Apply strace for syscall-level observability of a process
- Distinguish blocking vs non-blocking syscalls, errno, and EINTR
System Call Mechanism
**System call (syscall)** is the only legal way for a user mode program to request the operating system kernel to perform a privileged operation. It is the bridge between the user and the kernel.
**Why are system calls needed?** • **Isolation:** Direct access to hardware is prohibited in user mode - only the kernel can manage devices, memory, processes • **Security:** The kernel checks access rights before performing an operation • **Abstraction:** An application does not know hardware details - it works through a unified API • **Stability:** An error in user mode cannot crash the system
Every time a program reads a file, allocates memory, creates a thread, or sends data over the network - it makes a system call. Even a simple `printf()` ultimately calls `write()`.
Example cat file.txt
Running `cat file.txt` triggers the following sequence: 1. Shell does `fork()` - **syscall #57** 2. Child process does `execve("/bin/cat", ...)` - **syscall #59** 3. cat does `open("file.txt", O_RDONLY)` - **syscall #2** 4. cat reads data `read(fd, buf, count)` - **syscall #0** 5. cat outputs to stdout `write(1, buf, count)` - **syscall #1** 6. cat exits `exit(0)` - **syscall #60** Each line is a transition from user mode to kernel mode and back.
In Linux x86-64, there are about **400+ system calls**. Each has a unique number (syscall number). The program places the syscall number in the `%rax` register, arguments in other registers, and executes the `syscall` instruction.
**Interesting fact:** In Linux, syscall numbers differ for different architectures: • x86-64: `write` = 1, `read` = 0, `exit` = 60 • x86-32: `write` = 4, `read` = 3, `exit` = 1 • ARM64: `write` = 64, `read` = 63, `exit` = 93 Therefore, binaries are incompatible between architectures!
Most programmers never call syscalls directly - they use **wrappers in libc** (glibc, musl). For example, `open()`, `read()`, `write()` in C are glibc functions that internally make syscalls.
Why can't applications in user mode directly access hardware?
User to Kernel Transition
The transition from user mode to kernel mode is a **context switch**, similar to a context switch between processes, but lighter. The CPU switches the privilege level from Ring 3 to Ring 0 and transfers control to the kernel.
**Critical moment:** When transitioning to kernel mode, the CPU automatically: • Saves `RIP` (pointer to the next instruction in user mode) • Saves `RSP` (user stack pointer) • Saves `RFLAGS` (processor flags) • Loads kernel `RSP` from MSR (Model-Specific Register) • Jumps to the syscall handler address from MSR
System Call Overhead
**Why is syscall slow?** Transition user → kernel includes: • Saving user-space state (~10 registers) • TLB flush (cache of virtual addresses) - in some cases • Stack switch (user stack → kernel stack) • Access rights check • Reverse transition kernel → user All this takes **50-200 ns** on modern CPUs. For comparison: a regular function call - **1-2 ns**.
In older versions of Linux (before 2.6), the **int 0x80** instruction (software interrupt) was used for syscalls. It was even slower (~1000 ns) because the CPU processed the interrupt gate, saving more state.
**Intel vs AMD:** • Intel introduced the `sysenter/sysexit` instruction for fast syscalls • AMD introduced `syscall/sysret` • Linux x86-64 uses `syscall/sysret` (AMD-style) • Windows x86-64 uses `syscall` for x64, `int 0x2e` for x86 Modern CPUs from Intel support both instructions.
strace example
**Real tracing:** ```bash $ strace -c cat /etc/hostname % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 35.71 0.000050 50 1 execve 21.43 0.000030 10 3 openat 14.29 0.000020 10 2 read 14.29 0.000020 10 2 close 7.14 0.000010 10 1 write 7.14 0.000010 10 1 fstat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000140 10 total ``` For a simple `cat` - **10 system calls**, each ~10 µs (including kernel work).
What happens to the User Stack Pointer (RSP) when transitioning to kernel mode via syscall?
System Call Table
**Syscall table** is an array of pointers to handler functions in the kernel. When a program makes a syscall with number `N`, the kernel looks up `sys_call_table[N]` and calls that function.
**Syscall calling convention on x86-64:** • `%rax` - syscall number (input) and return value (output) • `%rdi` - arg1 • `%rsi` - arg2 • `%rdx` - arg3 • `%r10` - arg4 (not `%rcx`, because `rcx` is used by the CPU to save RIP!) • `%r8` - arg5 • `%r9` - arg6 Maximum 6 arguments - if more are needed, a pointer to a structure is passed.
Each syscall in the kernel has the prefix `sys_`. For example, `open()` in libc calls `sys_open()` in the kernel. The kernel function checks access rights, works with VFS (Virtual File System), interacts with drivers.
glibc wrapper
**How glibc wraps a syscall:** ```c // glibc: sysdeps/unix/sysv/linux/write.c ssize_t __write(int fd, const void *buf, size_t count) { return INLINE_SYSCALL_CALL(write, fd, buf, count); } // Expands to: // asm volatile( // "syscall" // : "=a"(ret) // : "a"(1), "D"(fd), "S"(buf), "d"(count) // : "rcx", "r11", "memory" // ); ``` glibc adds: • Error handling (converting negative values to `errno`) • Thread cancellation points • Signal safety checks • Buffering (for stdio)
**Important note:** Syscall numbers are part of the **ABI (Application Binary Interface)**, not the API. They cannot change between kernel versions - this would break compatibility with existing binaries.
strace raw mode
**Tracing with syscall numbers:** ```bash $ strace -e trace=write -e raw=write cat /etc/hostname write(0x1, 0x7ffde4b2e000, 0xd) = 13 │ │ │ │ │ │ │ └─ return value (13 bytes) │ │ └─ arg3: count = 13 │ └─ arg2: buffer address └─ arg1: fd = 1 (stdout) ``` `raw=write` shows the syscall in numbers (hex), not symbolic names.
**Syscall hooking:** In the past, rootkits modified the `sys_call_table`, replacing pointers with their own functions. Modern Linux kernels are protected: • The table is in read-only memory (Write Protection enabled) • Modules cannot export `sys_call_table` • Kernel lockdown mode blocks changes But it is still possible through `/dev/mem` or kernel modules using kprobes.
Why is argument #4 passed in %r10, not in %rcx, in the x86-64 syscall convention?
vDSO - syscalls without kernel transition
**vDSO (virtual Dynamic Shared Object)** is a brilliant Linux optimization: the kernel injects a small shared library into each process's address space with implementations of **some syscalls in user mode**, without transitioning to the kernel.
**Why vDSO?** Some syscalls are very frequent and cheap: • `gettimeofday()` - read current time • `clock_gettime()` - read monotonic clock • `getcpu()` - find out CPU number Transitioning to kernel mode (50-200 ns) is more expensive than the operation itself (5-10 ns). vDSO performs them **directly in user space**, reading data from shared memory that the kernel updates.
The kernel **automatically maps** vDSO into the address space when a process is created via `execve()`. The program can use vDSO through the dynamic linker, but this is transparent - libc itself finds symbols in vDSO.
vDSO Performance
**Benchmark: syscall vs vDSO** ```c #include <time.h> #include <sys/time.h> // With vDSO (libc uses vDSO automatically) struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); // ~20 ns // Without vDSO (direct syscall) syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &ts); // ~150 ns ``` vDSO provides a **7-10 times** speedup for such operations!
**How does the kernel update vDSO data?** The kernel and user space see the same memory page (shared mapping). The kernel periodically updates the timestamp and coefficients, and vDSO functions read them without a syscall.
**Which syscalls are in vDSO (Linux x86-64)?** • `__vdso_gettimeofday` - current time • `__vdso_clock_gettime` - monotonic clock • `__vdso_time` - seconds since epoch • `__vdso_getcpu` - CPU number On other architectures (ARM, PowerPC), the list may differ.
vDSO Limitations
**Why aren't all syscalls in vDSO?** Only syscalls that are: 1. **Read-only** - do not change kernel state 2. **Fast** - can be implemented by reading shared memory or CPU instructions (TSC) 3. **Frequent** - called very often (profiling, timing) For example, `read()/write()` cannot be in vDSO - they change the state of files, buffers, require access checks.
vDSO uses **RDTSC (Read Time-Stamp Counter)** - a CPU instruction to read the processor's cycle counter. This is a very fast (~3 cycles) way to get monotonic time.
perf stat vDSO
**Real measurement:** ```bash $ perf stat -e 'syscalls:sys_enter_*' date Wed 25 Dec 2025 22:43:15 MSK Performance counter stats for 'date': 0 syscalls:sys_enter_clock_gettime ← 0 syscalls! ``` `date` uses `clock_gettime()`, but **0 syscalls** - everything through vDSO!
All libc functions that interact with the kernel make a system call with a transition to kernel mode
Some frequent syscalls (gettimeofday, clock_gettime) are optimized through vDSO and executed in user mode without transitioning to the kernel
vDSO (virtual Dynamic Shared Object) is a shared library that the kernel injects into the process's address space. It contains implementations of some syscalls that work with read-only data from shared memory. The kernel updates this memory (timestamp, TSC coefficients), and vDSO functions read them directly. No kernel transition is needed, providing a 7-10 times speedup (20 ns vs 150 ns). This is transparent to the program - libc automatically uses vDSO if available.
Why can gettimeofday() be executed WITHOUT transitioning to kernel mode via vDSO?
Key Ideas
- **System call** - the only legal way for a program to request the kernel to perform a privileged operation. The CPU switches from Ring 3 (user mode) to Ring 0 (kernel mode), executes the handler from the syscall table, and returns back.
- **User → kernel transition** includes: saving registers (RIP, RSP, RFLAGS), switching the stack to the kernel stack, finding the handler in sys_call_table[syscall_number]. Overhead: 50-200 ns on modern CPUs via syscall/sysret (versus ~1000 ns via int 0x80).
- **Syscall table** - an array of pointers to kernel functions (sys_read, sys_write, ...). Each syscall has a unique number (part of ABI). On x86-64: number in %rax, arguments in %rdi, %rsi, %rdx, %r10, %r8, %r9. glibc wraps syscalls in convenient functions (open, read, write).
- **vDSO** - optimization for frequent read-only syscalls (gettimeofday, clock_gettime). The kernel injects a shared library into user space with implementations of syscalls that read data from shared memory. Speedup of 7-10 times (20 ns vs 150 ns), without transitioning to kernel mode.
Related Topics
System calls are a fundamental interaction with the OS, related to many concepts:
- Context Switch — Switching between processes is similar to syscall transition - saving/loading registers, changing address space. But context switch is more expensive (1-10 µs vs 50-200 ns) because it changes Page Tables.
- Virtual Memory — Syscall switches not only the stack but also Page Tables (CR3 register on x86). Kernel space is always mapped in the upper half of the address space (0xFFFF...), user space - in the lower.
- File Systems — Syscalls open, read, write, close - the basis of file operations. The kernel implements VFS (Virtual File System) for abstraction over different FS (ext4, btrfs, NFS).
- Interrupts — Hardware interrupts use a similar mechanism to transition to kernel mode via IDT (Interrupt Descriptor Table). The old way of syscall (int 0x80) was a software interrupt.
Вопросы для размышления
- Why is there a maximum of 6 arguments for syscall in x86-64? How to pass more arguments (e.g., for sys_clone with 7+ flags)?
- How does the kernel protect the syscall table from modification (e.g., from rootkits)? What mechanisms exist in modern Linux (W^X, KASLR)?
- Why can't vDSO implement write() or open() in user mode, even if data is in shared memory? What fundamentally distinguishes read-only syscalls from modifying ones?
- How do Spectre/Meltdown attacks use speculative execution to bypass user/kernel mode isolation? Why did syscalls become slower after these attacks (due to KPTI - Kernel Page Table Isolation)?
Связанные уроки
- os-02-processes — Processes use syscalls to access kernel resources
- os-07-memory — mmap and brk are syscalls for memory management
- ca-12 — Trap/interrupt mechanism at CPU level implements kernel switch
- net-15-tcp-basics — Network operations are send/recv syscalls per packet
- net-13-ports