Operating Systems

Debugging and Profiling

A production system crashes once a week without explanation. Logs are clean. Metrics are normal. It can't be reproduced locally. This is every engineer's nightmare. But FAANG companies have a secret weapon: strace, perf, eBPF, ASan. These tools find bugs that are invisible by ordinary methods. They work in production without stopping the service. They reveal what is hidden from the debugger. Master them - and systems of any complexity become debuggable.

  • **Netflix serves 200+ million users.** Any performance regression = millions of dollars in losses. They use perf + flame graphs to find CPU bottlenecks. Result: they find that 25% of CPU is spent on an outdated codec, switch to a new one → save $10M/year on servers.
  • **Cloudflare defends against DDoS attacks at 46 million requests/sec.** They use eBPF/XDP for packet filtering directly in the kernel. Speed: 24 million packets/sec on one core - 10x faster than iptables. eBPF protects half the internet from attacks.
  • **Facebook runs ASan on 1% of production servers (canary).** They catch use-after-free and buffer overflow BEFORE wide rollout. One caught bug saved from a potential RCE exploit that could have compromised millions of accounts.

Цели урока

  • strace for tracing process syscalls in real time
  • perf: CPU profiling, flame graphs (Brendan Gregg), hardware counters
  • eBPF (BCC, bpftrace): observability one-liners without kernel modules
  • Valgrind for memory errors (slow, ~50x slowdown)
  • AddressSanitizer (ASan, ~2x slowdown), TSan, UBSan for production hardening

Strace and Ltrace: X-ray of System Calls

**Strace** - a tool that shows ALL system calls (syscalls) of a program in real-time. It's like an X-ray of a process: every interaction with the kernel is visible - opening files, network requests, memory allocation, signals.

Production Case: Why is the Service Slow?

The API service started responding in 2 seconds instead of 50ms. Logs show nothing. Running `strace -p <PID>` on production shows: ``` open("/etc/hosts", O_RDONLY) = 3 read(3, "127.0.0.1 localhost\n...", 4096) = 180 close(3) = 0 ``` Repeats 1000 times per second! It turned out the DNS resolving library was reading `/etc/hosts` on every request instead of caching. One line of strace saved an hour of debugging.

**Ltrace** - similar to strace, but shows library function calls (libc, libssl, etc) instead of system calls. Use it when the problem is not in the kernel but in userspace libraries.

**Difference strace vs ltrace:** - **strace** → system calls (kernel): `open()`, `read()`, `socket()`, `fork()` - **ltrace** → library functions (userspace): `malloc()`, `strlen()`, `SSL_connect()` Example chain: ``` program → fopen() [ltrace sees] → internally calls open() [strace sees] → kernel opens the file ```

Debugging: Program Hangs on Start

Running `./myserver` causes it to hang without output. GDB doesn't help (no symbols). Strace reveals the cause: ```bash strace ./myserver ... connect(3, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("192.168.1.100")}, 16) = -1 ETIMEDOUT (Connection timed out) ``` BAM! The server is trying to connect to MySQL at IP 192.168.1.100, which is unreachable. Timeout 30 seconds. Problem found in 5 seconds.

**When to use strace/ltrace:** 1. Program hangs - find what it's stuck on (waiting for network? disk? lock?) 2. Slow performance - profile syscalls, see where the time is 3. "Permission denied" - strace will show WHICH file is inaccessible 4. File descriptor leaks - count `open()` vs `close()` 5. DNS/network issues - see all `connect()`, `sendto()`, `recvfrom()`

A production service started responding slowly. `top` shows low CPU usage (5%), but latency increased from 10ms to 1 second. Which tool will help find the problem?

Perf: CPU Profiling and Flame Graphs

**Perf** - the standard Linux profiler, using hardware performance counters (PMU - Performance Monitoring Unit). Shows exactly where the program spends CPU cycles - down to individual instructions and cache misses.

Real-world: Optimization at Dropbox

Dropbox engineers used perf to optimize Python code. They profiled production servers: ```bash perf record -g -p $(pgrep python) sleep 60 perf script | flamegraph.pl > flame.svg ``` Flame graph showed: 40% of CPU time was spent on JSON parsing. They replaced the standard `json` module with `ujson` (C-library) - speed doubled. One optimization saved thousands of servers.

**Flame Graphs** - visualization of a profile. X-axis - alphabetical order (not time!), Y-axis - stack depth, width - how much time the function took CPU.

**Perf events: not just CPU cycles** Perf can monitor hardware events: - `perf stat ./myapp` - basic statistics (instructions, cache misses, branch mispredictions) - `perf record -e cache-misses` - profile L1/L2/L3 cache misses - `perf record -e page-faults` - where page faults occur (memory swapping) - `perf record -e context-switches` - context switch frequency Example: ```bash perf stat ./myapp 10,523,456,789 instructions # 10 billion instructions 512,345,678 cycles # 512 million cycles 0.54 IPC # Instructions Per Cycle (efficiency) 12,345,678 cache-misses # Many! → optimize locality ```

Netflix: Using perf to Find CPU Bottleneck

Netflix serves 200+ million users. Any optimization = millions in savings. They use perf in production: 1. `perf record -ag -F 99 sleep 60` - profile all servers 2. Generate flame graph → find that 25% of CPU is spent on video compression with an outdated codec 3. Switch to a new codec (SVT-AV1) → CPU usage drops by 20% 4. Saved ~$10M/year on servers All thanks to one flame graph.

**When to use perf:** 1. Program is slow, but it's unclear where the bottleneck is 2. Optimization: find functions consuming the most CPU 3. Analyze cache efficiency (many cache misses?) 4. Check after optimization (is it faster?) 5. Production profiling - can be run on live servers (overhead ~1-5%)

A perf run produced a flame graph. A wide block of the function `hash_table_lookup()` shows at the top level (leaf). What does this mean?

eBPF: Kernel Programming Without Kernel Modules

**eBPF (extended Berkeley Packet Filter)** - a revolution in Linux monitoring. Allows running safe code directly in the kernel, without compiling kernel modules. Trace anything: syscalls, kernel functions, network packets, filesystem operations - all in real-time with minimal overhead.

**Why is eBPF safe?** Code passes a verifier before loading: no infinite loops, no invalid memory accesses, limited stack. If the verifier rejects - the program won't load. It's like JIT in V8/JVM, but for the kernel.

Production Case: Unexplained Delays in Kubernetes

A Kubernetes cluster started showing random latency spikes (99th percentile 500ms → 5 seconds). Logs are clean, metrics normal. We run bpftrace on the node: ```bash # Trace scheduler delays (how long the process waited for CPU) sudo bpftrace -e ' tracepoint:sched:sched_switch { @qtime[args->next_pid] = nsecs; } tracepoint:sched:sched_wakeup { $wait = nsecs - @qtime[args->pid]; if ($wait > 1000000) { // > 1ms printf("PID %d waited %d ms\n", args->pid, $wait/1000000); } }' ``` It turns out: one Pod is consuming 100% CPU, causing scheduler thrashing for neighbors. cgroup limits weren't working due to a bug in kernel 4.14. Updated kernel → problem solved.

**BCC (BPF Compiler Collection)** - a set of ready-to-use eBPF tools: - `execsnoop` - logs all process executions (exec) - `opensnoop` - all file openings - `tcpconnect` - all TCP connections - `biolatency` - histogram of disk I/O latency - `funccount` - function call counter (kernel or userspace) - `trace` - universal tracer (similar to SystemTap) Example: ```bash # See who is connecting over the network sudo tcpconnect PID COMM SADDR DADDR DPORT 1234 chrome 127.0.0.1 1.2.3.4 443 5678 ssh 10.0.0.5 20.0.0.10 22 ```

Cloudflare: eBPF for DDoS Protection

Cloudflare handles 46 million HTTP requests/sec. They use eBPF for filtering DDoS attacks directly in the kernel: 1. eBPF program on XDP (eXpress Data Path) checks packets BEFORE the TCP/IP stack 2. Blocks malicious packets at 24 million packets/sec on ONE core 3. Regular iptables/netfilter: ~2 million pps 10+ times faster! eBPF operates at the network driver level, bypassing the entire networking stack.

**When to use eBPF:** 1. Production debugging without reboot - inject code into the kernel on the fly 2. Performance analysis: where are the delays? where is the contention? which syscalls? 3. Security monitoring: trace all exec, file access, network connections 4. Network performance: XDP for packet filtering/load balancing 5. Kernel development: debug new features without rebuilding the kernel

How does eBPF differ from traditional kernel modules (loadable kernel modules)?

Memory Debugging: Valgrind, ASan, Leak Detection

**Memory bugs** - the most insidious: use-after-free, double-free, buffer overflow, memory leaks. They manifest unpredictably: a program may run for months, then suddenly crash. Memory debugging tools find bugs BEFORE they hit production.

**Types of memory bugs:** 1. **Memory leak** - allocated memory, forgot to free (`malloc` without `free`) 2. **Use-after-free** - using memory after `free()` (data may be overwritten) 3. **Double-free** - called `free()` twice on one address (heap corruption) 4. **Buffer overflow** - writing beyond array bounds (`arr[100]` when size is 50) 5. **Uninitialized memory** - reading a variable before initialization (random garbage) 6. **Stack overflow** - infinite recursion or huge stack array

**Valgrind drawbacks:** slows down the program by 10-50 times. Not suitable for production. But great for testing: run tests under Valgrind → find all memory issues.

Production Case: Intermittent Crash in Redis

Redis periodically crashed once a week without a pattern. Core dump didn't help (corrupted heap). Rebuilt Redis with ASan: ```bash make SANITIZER=address ./redis-server --sanitizer ``` After 2 days ASan caught: ``` heap-use-after-free in module API callback ``` It turned out: a third-party Redis module held a pointer to a string that Redis had already freed. One line - millions of users saved from crashes.

**Comparison of Memory Debugging Tools:**

ToolSpeedWhat it FindsUsage
**Valgrind**10-50x slowerLeaks, use-after-free, overflows, uninit readsDevelopment/Testing
**ASan**2x slowerOverflows, use-after-free, double-freeDevelopment/Testing/Canary prod
**TSan**5-15x slowerData races, deadlocksDevelopment/Testing
**MSan**3x slowerUninitialized memory readsDevelopment
**eBPF (bcc memleak)**<5% overheadMemory leaks in productionProduction

In practice: ASan in CI/CD, eBPF in production for monitoring.

Facebook: ASan on Production Canary

Facebook runs 1% of production servers with ASan enabled (canary deployments). Overhead 2x is acceptable for a small part of the fleet. Result: - Catch use-after-free and overflows BEFORE wide rollout - One caught bug saved from a potential security incident (RCE exploit via buffer overflow) - ASan became part of the deployment pipeline: canary → ASan clean → rollout 100%

**Best Practices for Memory Debugging:** 1. **Development:** Always compile with ASan/UBSan (`-fsanitize=address,undefined`) 2. **CI/CD:** Run tests under Valgrind/ASan (find bugs before merge) 3. **Production:** eBPF `memleak` for monitoring leaks, ASan on canary servers 4. **Core dumps:** Enable `ulimit -c unlimited` + analyze with gdb on crashes 5. **Fuzz testing:** AFL, libFuzzer with ASan - automatic bug finding through random inputs

Memory leaks are not critical - modern OSes free all memory when the process ends

Memory leaks are critical for long-running processes: servers, daemons, desktop applications run for days/months without restart

Yes, the OS will free memory when the process ends. But a server running 24/7 will consume more and more RAM until it hits the limit and crashes (OOM killer). Example: leak 1KB/sec → 86MB per day → 2.5GB per month → server dies in six months. In a microservices architecture, a leak in one service can bring down the entire cluster through cascade failure. Therefore, memory leaks are CRITICAL, especially in production systems.

A C++ program periodically crashes in production (once a week), but can't be reproduced locally. Core dump shows corrupted heap. What will help find the cause?

Key Ideas

  • **strace/ltrace - process X-ray:** show every syscall and library call in real-time. Indispensable for debugging: program hangs? strace shows on what. Permission denied? strace shows which file. They work in production with low overhead.
  • **perf - CPU profiling through hardware counters:** records where the program spends CPU cycles. Flame graphs visualize hotspots. Netflix/Dropbox optimize production using perf - they find functions consuming CPU and speed them up significantly.
  • **eBPF - kernel programming without risk:** safely inject code into the kernel for tracing syscalls, network, scheduler. The verifier ensures it won't crash the system. Cloudflare uses it for DDoS protection, Uber for performance monitoring.
  • **ASan/Valgrind - memory debugging:** find use-after-free, buffer overflow, leaks. ASan is 2x slower (suitable for canary prod), Valgrind is 10-50x (only dev/test). Facebook/Google run ASan on production canary - they catch critical bugs before wide rollout.

Related Topics

Debugging and profiling are related to all aspects of systems programming:

  • System Calls (syscalls) — strace shows every syscall - understanding how a program interacts with the kernel is critical for debugging
  • Memory Management — ASan/Valgrind find memory corruption - understanding how heap, stack, virtual memory work is necessary
  • Scheduling and Multithreading — perf shows context switches and CPU time - helps optimize multithreaded programs
  • Networking — eBPF/XDP are used for packet filtering and network debugging - understanding the TCP/IP stack is essential

Вопросы для размышления

  • A production service runs at 10ms latency, but once a minute there is a spike to 5 seconds. Logs show no errors. Which tools (strace/perf/eBPF) suit the job, and in what sequence?
  • A C++ program slowly consumes memory (1GB per day), but Valgrind shows "no leaks." How is this possible? (Hint: reachable vs lost memory) What tools to use to find the problem?
  • Why is eBPF considered safe for production, while a regular kernel module is not? What guarantees does the eBPF verifier provide and what bugs can't it prevent?

Связанные уроки

  • arch-01-binary
Debugging and Profiling

0

1

Sign In