Operating Systems

Containerization from the Inside

Docker changed the world of development. "Works on my machine" became an anachronism. Kubernetes manages billions of containers in data centers of Google, AWS, Microsoft. But few understand what happens under the hood. Containers are a combination of three Linux mechanisms: namespaces (isolation), cgroups (limitations), OverlayFS (efficient storage). Understanding these mechanisms is the key to debugging production incidents, optimizing performance, and designing cloud-native systems.

**Google runs 2+ billion containers per week** through Borg (the predecessor of Kubernetes). All of Google's infrastructure (Search, Gmail, YouTube) runs in containers. Understanding namespaces and cgroups is critical for SRE in FAANG.
**Why does a Kubernetes Pod crash with OOMKilled?** This is not a Kubernetes bug - it's the cgroup memory limit in action. The process consumed more memory than allowed in `resources.limits.memory`. OOM killer killed the process. Debugging: `kubectl describe pod` → Events → OOMKilled. Fix: increase the limit or optimize the application.
**Docker image weighs 2GB, but the build takes 10 seconds.** This is OverlayFS in action. Docker reuses cached layers. Changed one line of code - only the last layer (file copying) is rebuilt. Other layers (apt-get install, npm install) are taken from cache. Understanding layers is a 10× acceleration of CI/CD.

Цели урока

Know the Linux namespaces: PID, NET, MNT, UTS, IPC, USER, CGROUP
Cgroups v1 vs v2: hierarchy, memory/cpu/io controllers
OverlayFS: union mount, copy-up on write, layered images
Container runtime stack: runc (low-level), containerd, Docker, Podman
Container security: rootless containers, user namespaces, seccomp-bpf profiles

Linux Namespaces - Process Isolation

**Linux Namespaces** - a system resource isolation mechanism that creates the illusion that a process is running in a separate operating system. This is the foundation of containerization: Docker, Kubernetes, containerd - all are built on namespaces.

Analogy: an apartment building. Each apartment (namespace) is isolated, has its own door (PID 1), its own windows (network), its own meters (filesystems). Residents of one apartment do not see what happens in the neighboring ones. But physically, all apartments are in one building (one Linux kernel).

**Types of namespaces in Linux:** - **PID namespace** - isolation of process IDs (the container sees its own PID 1) - **Network namespace** - separate network stack (IP, routes, firewall) - **Mount namespace** - isolation of filesystem mount points - **UTS namespace** - separate hostname and domain name - **IPC namespace** - isolation of interprocess communication (SysV IPC, POSIX queues) - **User namespace** - UID/GID mapping (root in the container ≠ root on the host) - **Cgroup namespace** - hiding the cgroups hierarchy from the process

**PID namespace:** Each container has its own isolated process hierarchy. A process with PID 1 inside the container may actually be process 15234 on the host. Processes inside the container do not see host processes.

Network namespace - how Docker creates virtual networks

Running `docker run -p 8080:80 nginx` causes Docker to create: 1. **A new network namespace** with its own loopback (127.0.0.1) and network stack 2. **A pair of veth devices** (virtual Ethernet cable): one end in the container (eth0), the other on the host (vethXXX) 3. **Bridge docker0** on the host, to which all veth are connected 4. **iptables rules** for port forwarding 8080 (host) → 80 (container) Each container thinks it has a real network card eth0 with an IP address. In reality, it is a virtual device in an isolated namespace.

**User namespace - container security:** Without user namespace, a process with UID 0 in the container = root on the host. If a process escapes the container, it gains full control over the system. With user namespace: - UID 0 in the container → UID 1000 on the host (regular user) - Even if the process escapes, it has no root privileges Docker uses user namespaces optionally (`--userns-remap`), Kubernetes/Podman enable them by default.

Three Docker containers are running. What will a process with PID 1 inside the first container see when executing `ps aux`?

Control Groups - Resource Management

**Control Groups (cgroups)** - a Linux kernel mechanism for limiting, accounting, and isolating resources (CPU, memory, disk I/O, network) for groups of processes. If namespaces are responsible for visibility isolation, then cgroups control resource access.

Namespaces say: "the process does not see its neighbors." Cgroups say: "the process can use at most 512MB RAM and 0.5 CPU cores." Together they create a container: an isolated environment with guaranteed resources.

**Main cgroup controllers (v2):** - **cpu** - CPU time limits (CFS bandwidth) - **memory** - RAM limitation (hard/soft limits, OOM killer) - **io** - control of disk operations (bandwidth, IOPS) - **pids** - maximum number of processes in the group - **cpuset** - binding to specific CPU cores - **devices** - permissions to access devices (/dev) - **freezer** - pause all processes in the group

How Kubernetes uses cgroups for resource limits

A Kubernetes manifest: ```yaml resources: limits: memory: "512Mi" cpu: "500m" # 0.5 CPU cores requests: memory: "256Mi" cpu: "250m" ``` Kubernetes (via containerd/CRI-O) creates a cgroup: - `memory.max = 512 * 1024 * 1024` (hard limit - OOM kill on exceedance) - `memory.low = 256 * 1024 * 1024` (soft limit - guaranteed memory) - `cpu.max = 50000 100000` (50% CPU time) If the pod tries to consume 600 MB - **OOM killer kills the process**. If it tries to load the CPU to 100% - **throttling** limits it to 50%.

**Memory cgroup and OOM Killer:** When a process in a cgroup exceeds `memory.max`, the kernel triggers the Out-Of-Memory killer. The OOM killer selects a process to kill based on: 1. **oom_score** - heuristic "how harmful is the process" (lots of memory + low priority = high score) 2. **oom_score_adj** - manual adjustment (from -1000 to +1000) Docker sets `oom_score_adj` so that the container is killed before system processes.

**CPU throttling in action:** If a container with a limit `cpu.max = 50000 100000` (50% of one core) tries to use more CPU: - The kernel tracks CPU time over the period (100ms) - As soon as the process exhausts the quota (50ms), it is **throttled** (frozen) - For the remainder of the period, the process does not receive CPU time - In the next period, the quota is refreshed This ensures that a "noisy neighbor" does not steal CPU from other containers.

A container with a memory limit of 512 MB tries to allocate 600 MB RAM. What will happen?

OverlayFS - Layered Filesystem

**OverlayFS** - a union filesystem that combines multiple directories into one virtual filesystem. This trick allows Docker images to be small and containers to start instantly.

Analogy: transparent films with drawings. Stacked on top of each other, they form a combined picture. Lower layers (read-only) are the base image. The upper layer (read-write) holds the container's changes. Deleting the container removes the upper layer; the base image remains untouched.

**OverlayFS Structure:** - **Lower layers (lowerdir)** - read-only layers of the base image (Ubuntu, Nginx, application code) - **Upper layer (upperdir)** - read-write layer for container changes - **Work directory (workdir)** - temporary directory for atomic operations - **Merged view** - the combined filesystem seen by the container

How Docker uses layers to save space

10 containers with an Ubuntu 22.04 base (size 77 MB). Without OverlayFS, that would require 770 MB. With OverlayFS: - **1 lower layer** with Ubuntu (77 MB) - shared among all containers - **10 upper layers** (one per container) - only changes (usually kilobytes) Total: ~77 MB instead of 770 MB. 90% savings! When starting a container, Docker does not copy the image. It simply mounts the lower layers read-only and creates an empty upper layer. Container startup is instant.

**Copy-on-Write (CoW):** When a process in a container modifies a file from the lower layer, OverlayFS copies it to the upper layer and modifies the copy. The original in the lower remains untouched. This is called Copy-on-Write.

**Whiteout files - deletion in OverlayFS:** How can a file from the lower layer (read-only) be deleted? The lower layer is immutable. Solution: a **whiteout file** (character device 0/0) is created in the upper: ```bash rm overlay/merged/file.txt # A whiteout marker appeared in the upper ls -l overlay/upper/ c--------- 1 root root 0, 0 Dec 25 12:00 file.txt ``` OverlayFS sees the whiteout in the upper and hides the file from the lower. The file is "deleted" in the merged view, but physically remains in the lower (image untouched).

Why Docker image build should be optimized by layers

Bad Dockerfile: ```dockerfile FROM node:18 COPY . /app # All code in one layer RUN npm install # Dependencies after code ``` Changing **one file** of code invalidates the COPY layer and everything after it. Docker rebuilds `npm install` (downloads packages again) - a waste of time. Good Dockerfile: ```dockerfile FROM node:18 COPY package*.json /app/ # Dependencies separately RUN npm install # Cached if package.json unchanged COPY . /app # Code in the last layer ``` Now changing the code does not invalidate `npm install`. Docker reuses the cached layer. Build in seconds instead of minutes.

5 containers run from one Ubuntu image (size 100 MB). Each container wrote 10 MB of logs. How much disk space is occupied?

Container Runtime - from Docker to Kubernetes

**Container Runtime** - a program that manages the lifecycle of containers: creating namespaces, setting up cgroups, mounting OverlayFS, starting the process. Docker, containerd, CRI-O, runc - all these are container runtimes at different levels of abstraction.

**Levels of container runtimes:** - **High-level runtime** - manages images, network, volume (Docker, containerd, CRI-O) - **Low-level runtime** - creates a container from a bundle (runc, crun, kata-runtime) - **CRI (Container Runtime Interface)** - Kubernetes standard for working with runtimes

**runc - reference implementation of OCI (Open Container Initiative).** OCI Runtime Spec describes a JSON config: - Which rootfs to use (OverlayFS merged) - Which namespaces to create (pid, net, mount, uts, ipc, user) - Which cgroups to set up (memory.max, cpu.max) - Which process to run (entrypoint + args) - Which capabilities to give the process (CAP_NET_ADMIN, CAP_SYS_ADMIN)

From Docker to Kubernetes: evolution of container runtime

**2013-2016: Docker monoplatform** Docker Engine - all-in-one: build, pull, run, networking. Kubernetes used Docker via `docker-shim` (a layer for calling Docker API). **2016: containerd split from Docker** Docker split into components: - `dockerd` - CLI, API, build - `containerd` - container and image management - `runc` - low-level runtime Kubernetes began supporting containerd directly (bypassing Docker). **2020: Docker deprecated in Kubernetes** Kubernetes removed `dockershim`. Now containerd or CRI-O is used directly via CRI. Containers remained the same (OCI standard), only the runtime changed. **Why containerd, not Docker?** Docker is too "heavy" for Kubernetes (build, volumes, networks, swarm). Kubernetes only needs a runtime. containerd is minimalist, only for running containers.

**CNI (Container Network Interface):** Plugins for setting up container networks. When containerd creates a container, it calls a CNI plugin (bridge, flannel, calico): 1. Create a network namespace 2. Create a veth pair (virtual cable) 3. Connect one end to the bridge, the other to the container's namespace 4. Assign an IP address (via IPAM plugin) 5. Set up routes and iptables

**Security: capabilities and seccomp** Linux capabilities - granular privileges instead of "all or nothing" root. By default, Docker gives the container a limited set of capabilities: - `CAP_NET_RAW` - create raw sockets (ping) - `CAP_CHOWN` - change file ownership But **does NOT give**: - `CAP_SYS_ADMIN` - mount filesystems, change namespaces - `CAP_NET_ADMIN` - configure network (iptables) **seccomp** (Secure Computing Mode) - whitelist of system calls. Docker blocks dangerous syscalls: `reboot()`, `swapon()`, `mount()`. Even if a process is compromised, it cannot harm the host.

Containers are lightweight virtual machines

Containers are isolated processes on the host system, using a single Linux kernel

A VM emulates hardware: each VM has its own kernel, init system, full OS. The hypervisor isolates VMs through hardware virtualization (Intel VT-x, AMD-V). A container is a regular Linux process, isolated through namespaces and cgroups. **One kernel** serves all containers. A container starts in milliseconds (just fork + exec), a VM in seconds (needs to load the kernel). Therefore: - A container with Ubuntu cannot run on Windows without WSL2 (needs Linux kernel) - A VM with Ubuntu works on any hypervisor (VMware, VirtualBox, KVM) Containers are lighter (MB instead of GB), faster (ms instead of seconds), but less isolated (shared kernel = shared attack surface). For maximum isolation, Kata Containers are used (containers in micro-VMs).

Key Ideas

**Namespaces isolate resource visibility.** PID namespace - the container sees only its own processes (PID 1 inside ≠ PID on the host). Network namespace - separate IP stack (veth pairs + bridge). Mount namespace - filesystem isolation. User namespace - root in the container ≠ root on the host (security).
**Cgroups limit resource consumption.** Memory cgroup: `memory.max = 512MB` → OOM killer on exceedance. CPU cgroup: `cpu.max = 50000/100000` → throttling to 50% of one core. IO cgroup: disk bandwidth limitation. Kubernetes `resources.limits` → directly map to cgroup files.
**OverlayFS saves space and speeds up startup.** Base image (Ubuntu, Nginx) - read-only layers, shared among containers. Container changes - separate read-write layer (upper). Copy-on-Write: modifying a file from lower → copied to upper. 10 containers with one image = 1× image size + N× changes size (usually MB).
**Container runtime - layers of abstraction.** Docker/containerd (high-level): pull images, manage network, storage. runc (low-level): create namespaces, set up cgroups, exec process. Kubernetes uses containerd/CRI-O via CRI API. Docker images = OCI images (universal standard).

Вопросы для размышления

Docker containers are considered less isolated than VMs. What attacks are possible through the shared Linux kernel? How to protect (user namespaces, seccomp, AppArmor)?
Kubernetes Pod with 3 containers: do they share a PID namespace or does each have its own? What about the network namespace? Why can containers in one Pod communicate via localhost?
A serverless platform (AWS Lambda, Cloud Run): how can a container start quickly enough to handle an HTTP request (cold start < 100ms)? What optimizations are needed (pre-warmed containers, snapshot restore, firecracker microVM)?