DevOps

Kubernetes: Architecture

Where Kubernetes Came From

2014. Google is running 2 billion containers per week through an internal system called Borg. Outside Google, nothing like this exists. A team of engineers writes Kubernetes as an open version of Borg and releases it to the world. By 2019 it becomes the de facto operating system of the cloud.

Control Plane: The Brain

**2014.** Google open-sources Kubernetes. Internally, Borg had been running 2 billion containers per week - powering Search, Maps, YouTube through a single infrastructure layer. Outside Google, nothing like this existed. Engineers handed the world an open version of Borg. Within 5 years, Kubernetes became the de facto operating system of the cloud - not a metaphor, a literal description of its role.

Kubernetes splits into two planes: the control plane knows what should exist; the data plane makes it exist. The control plane runs on master nodes and never schedules user containers. It only makes decisions.

API Server

**kube-apiserver** is the single component that everything else talks to. `kubectl apply`, the dashboard, CI/CD pipelines, cluster components themselves - all go through the REST API server. It validates requests, enforces RBAC, persists objects to etcd, and broadcasts events to subscribers. No two components communicate directly - only through the API server. One communication channel, one audit trail. This is a deliberate architectural constraint.

etcd

The only stateful component in the control plane. A distributed key-value store on Raft consensus holds all cluster state: which Pods should be running, where they are running, which ConfigMaps and Secrets exist. If etcd dies and there is no backup - the cluster becomes read-only. Kubernetes without etcd has no knowledge of itself. For HA clusters, typically 3 or 5 etcd nodes are used (odd count for quorum).

Scheduler

**kube-scheduler** watches for Pods in `Pending` state (no assigned node) and decides where to place them. Two-phase algorithm: filtering (eliminate nodes that cannot accept the Pod - insufficient resources, mismatched taints, affinity rules) and scoring (rank remaining nodes across dozens of metrics). The scheduler does not start containers - it writes its decision to the API server. Then kubelet on the chosen node sees the new Pod and starts it.

Controller Manager

**kube-controller-manager** is dozens of controllers in a single process. The ReplicaSet controller watches: if 3 replicas are desired but only 2 are running - create one more. If 4 are running - kill the extra. The Node controller marks nodes as `NotReady` when heartbeats stop. The Endpoint controller updates the IP list behind each Service. Every controller is an infinite reconciliation loop: observe desired state, compare to actual state, close the gap.

The reconciliation loop pattern underpins all of Kubernetes. Desired state is declared in etcd. Controllers continuously close the gap between reality and the declaration. This is level-triggered (not edge-triggered): controllers react to state, not to events. Missing an event is harmless - the next reconciliation cycle will fix the drift.

kube-scheduler records its placement decision by:

Data Plane: The Hands

Worker nodes do not make decisions. They execute. Three components run on each: kubelet starts containers, kube-proxy programs the network, and a CNI plugin provides connectivity. Mastodon, Twitter, GitLab - all of this eventually runs on nodes like these, on someone's cloud servers.

kubelet

The primary agent on each worker node. kubelet registers the node with the cluster, then watches via the API server for Pods assigned to its node. When a new Pod appears - kubelet instructs the container runtime (containerd, CRI-O) to start the required containers. When a Pod should terminate - kubelet sends SIGTERM, waits for `terminationGracePeriodSeconds`, then sends SIGKILL. Every 10 seconds kubelet sends a heartbeat to the API server. If the heartbeat stops - the node controller marks the node `NotReady` after 40 seconds, and evacuates Pods after 5 minutes.

kube-proxy

A Service in Kubernetes is a virtual IP that load-balances traffic across Pods. kube-proxy implements this abstraction. On each node it watches Service and Endpoint objects, then programs iptables or IPVS rules. When a packet arrives at ClusterIP:port - iptables/IPVS intercepts and forwards it to one of the Pod IPs. kube-proxy does not proxy traffic itself (despite the name) - it only manages kernel rules.

CNI Plugin

Container Network Interface is the standard Kubernetes calls when creating a Pod. The CNI plugin (Calico, Flannel, Cilium, Weave) receives the new container's network namespace and must: create a network interface, assign an IP, configure routing so any Pod can reach any other Pod by IP without NAT. This is the fundamental networking principle in Kubernetes: flat network. Cilium uses eBPF instead of iptables - an order of magnitude faster at thousands of rules.

Kubernetes requires: each Pod gets a unique IP, Pods on different nodes communicate directly (no NAT), agents on the node see Pod IPs as-is. The CNI plugin implements these requirements. Plugin choice impacts performance and features (not every plugin supports NetworkPolicy).

Which component ensures that traffic sent to ClusterIP:80 reaches one of the backend Pods?

Pod Lifecycle and Deployment vs StatefulSet

The Pod is the minimum deployable unit in Kubernetes - not a container. One Pod can hold multiple containers that share a network namespace (shared IP, shared ports) and can share volumes. The sidecar pattern (Envoy proxy next to the application in a Service Mesh) works exactly this way: two containers in one Pod on the same localhost.

Pod Lifecycle

A Pod moves through states: `Pending` (waiting for scheduler or image pull), `Running` (at least one container is running), `Succeeded` (all containers exited with code 0 - typical for Jobs), `Failed` (a container exited with non-zero), `Unknown` (kubelet is not responding). `CrashLoopBackOff` is not a state - it is a status: the container keeps crashing and kubelet restarts it with exponential backoff (10s, 20s, 40s... up to 5 minutes).

Pods are ephemeral. Kubernetes does not heal Pods: when a Pod dies, it is gone permanently. The container inside a Pod can restart (if `restartPolicy` allows), but not the Pod itself. This is why Pods are not created directly - Deployment or StatefulSet controllers recreate Pods when they disappear.

Deployment

A Deployment manages a ReplicaSet, which manages a set of identical Pods. Identical is the key word. Each Pod in a Deployment is interchangeable: no stable name, no persistent storage, no identity. Rolling update creates a new ReplicaSet with the new image and gradually shifts traffic. Rollback is a switch back to the previous ReplicaSet. For stateless applications (web servers, APIs) - the ideal primitive.

StatefulSet

StatefulSet solves the problem Deployment cannot: stable Pod identity. Pods get predictable names: `postgres-0`, `postgres-1`, `postgres-2`. When `postgres-0` is recreated, it gets the same name and the same PersistentVolume. This enables clustered databases (Cassandra, MongoDB, PostgreSQL Patroni) where each node knows its peers. Pod creation and deletion follow strict ordering (0, 1, 2... and 2, 1, 0 for deletion) - `postgres-1` cannot start before `postgres-0` is ready.

When to use which: Deployment - stateless workloads. StatefulSet - databases, queues, Kafka, Elasticsearch. DaemonSet - one Pod per node (monitoring agent, log shipper, CNI plugin). Job/CronJob - one-time or periodic tasks.

Kubernetes restarts a crashed Pod

Kubernetes restarts the container inside the Pod. If the Pod itself is deleted - Deployment/StatefulSet creates a new Pod. The Pod object is not restored.

Pods are ephemeral objects. Container restart (restartPolicy) and Pod recreation (controller) are separate mechanisms. CrashLoopBackOff means the container is restarting, but the Pod object is still alive.

Cassandra requires stable node identity (each node knows its peers by name) and its own PersistentVolume. Which Kubernetes object should be used?

Key Ideas

**Control plane** (API server + etcd + scheduler + controller manager) makes decisions. **Data plane** (kubelet + kube-proxy + CNI) executes them. No direct cross-component communication - only through the API server.
**Reconciliation loop** is the heart of K8s: controllers continuously compare desired state (in etcd) with actual state and close the gap. Missed events are harmless - the next cycle fixes drift.
**Pod** is ephemeral. **Deployment** is for stateless workloads with interchangeable replicas. **StatefulSet** is for databases with predictable names and persistent PVCs.

Вопросы для размышления

etcd is the only stateful component in the control plane. What happens to already-running Pods if etcd goes down?
The reconciliation loop reacts to state, not events. What is the advantage of this approach during network partitions?
StatefulSet gives Pods predictable names (postgres-0, 1, 2). How does a database use this to configure replication?

Связанные уроки

devops-05 — Docker Compose is the prior level of orchestration before K8s
devops-07 — Advanced K8s patterns build on top of this architecture
devops-08 — Service Mesh (Istio, Linkerd) runs on top of the K8s data plane
ds-05-replication — StatefulSet solves the same problems as database replication
devops-03 — CNI and kube-proxy are built on Linux networking primitives
dist-09-raft
net-48-kubernetes-networking