Networking for DevOps

Цели урока

Understand the TCP/IP model and distinguish TCP vs UDP by use case
Use dig, curl, ss for diagnosing network issues
Configure DNS records and understand TTL for migrations
Distinguish L4 and L7 load balancing, choose the right algorithm
Apply default-deny when configuring firewalls

Предварительные знания

Linux for DevOps

October 4, 2021. Facebook, Instagram, WhatsApp vanished for 6 hours. Engineers couldn't badge into the data centers - their badges ran on the same network. Cause: a single BGP command nuked the company's DNS records. 3.5 billion users offline, tens of millions of dollars in losses. One network error. One config line.

**Every HTTP request** goes through DNS, TCP/IP, possibly a load balancer and a firewall. Without networking knowledge, no production incident can be debugged
**Kubernetes networking** - pods, services, ingress - all based on TCP/IP, DNS, and L7 load balancing
**Microservices** communicate over the network. Network issues (timeout, DNS, firewall) are the number-one cause of incidents in distributed systems
**Security** - an open Redis = compromised in minutes. Understanding ports and firewalls saves companies from data breaches

The Birth of TCP/IP

In 1974, Vint Cerf and Bob Kahn published a paper on the TCP protocol - the foundation of the internet. The early ARPANET used NCP, but TCP/IP proved so successful that on January 1, 1983, ARPANET fully switched to it - a day known as the "birthday of the internet". Cerf and Kahn weren't thinking about Netflix and Kubernetes - they were thinking about reliable military communication. The protocol outlived them all.

TCP/IP: How Data Travels Across a Network

October 2021. Facebook engineers can't badge into the data centers - their badges ride the same network that just vanished. 3.5 billion users, 6 hours of silence, root cause: a single line in a BGP config. Network errors don't look like errors until they take down everything at once. Every HTTP request crosses **4 layers** of the TCP/IP model, and every layer is a potential failure point.

**TCP vs UDP** - two transport protocols with opposite guarantees. TCP is a contract: every packet delivered, order kept, losses retransmitted. UDP is a volley: faster, no promises. Streaming video would rather drop a frame than buffer for three seconds.

Characteristic	TCP	UDP
Delivery guarantee	Yes (retransmission)	No
Packet ordering	Guaranteed	Not guaranteed
Connection setup	3-way handshake	None (connectionless)
Speed	Slower (overhead)	Faster
Use cases	HTTP, SSH, PostgreSQL, email	DNS, video, VoIP, gaming

**3-way handshake** - a TCP connection comes up in three steps: SYN, SYN-ACK, ACK. Only then does data start flowing. That handshake is the "before-the-first-byte" latency - TTFB, the metric staring back from every DevOps dashboard.

**Ports** - numbers from 0 to 65535 that pin a specific application. The IP address is the building, the port is the apartment number. 127.0.0.1:5432 - PostgreSQL only takes local connections. 0.0.0.0:5432 - PostgreSQL listens on every interface.

PostgreSQL is listening on 127.0.0.1:5432. Can it be reached from another server?

DNS: The Internet's Phone Book

People remember names (google.com); computers run on numbers (142.250.74.46). **DNS** - a distributed hierarchical database, online since 1983, bridges the gap. When Facebook went dark in 2021, a BGP error wiped its DNS records and the entire internet stopped seeing Facebook - even though the servers were physically running.

DevOps engineers live and breathe **DNS record types** - configuring them is part of every new service deployment.

Record Type	Purpose	Example
A	Domain → IPv4 address	api.example.com → 203.0.113.42
AAAA	Domain → IPv6 address	api.example.com → 2001:db8::1
CNAME	Alias (domain → domain)	www.example.com → example.com
MX	Mail server	example.com → mail.example.com (priority 10)
TXT	Arbitrary text	SPF, DKIM, domain verification
NS	Nameserver for zone	example.com → ns1.registrar.com

**TTL** (Time To Live) - how long a DNS record stays cached, in seconds. High TTL (86400 = 24 hours) - lighter DNS load, but slow cutover during migration. Low TTL (60 = 1 minute) - fast cutover, more queries hitting the server.

Before a server migration: drop the TTL to 60 seconds 24-48 hours BEFORE the cutover. When the IP flips, users follow within a minute instead of a full day. After migration, raise the TTL back.

DNS caching produces surprises. The A record is updated, but some users still hit the old IP because the previous record's TTL has not expired. Wait out the TTL, or query a different resolver via `dig @8.8.8.8` - no cache there.

A domain's A record was updated (TTL was 86400). Two hours have passed, but some users still see the old IP. Why?

Load Balancing

One server handles 1,000 RPS. Traffic grows to 10,000 RPS. Horizontal scaling: spin up many servers and slot a **load balancer** in front. This is exactly how Netflix rides peaks of 80 million simultaneous streams - not one giant box, but thousands of machines behind a smart load balancer.

**L4 vs L7** - the headline distinction. L4 runs at the transport layer (TCP), sees only IP and port - blistering speed, no smarts. L7 runs at the application layer (HTTP), reads URLs, headers, cookies - and routes `/api/*` to the backend cluster and `/static/*` to a CDN.

Algorithm	How It Works	When to Use
Round Robin	In turn: 1→2→3→1→2→3	Same-capacity servers, stateless applications
Least Connections	Route to server with fewest active connections	Mixed request weight (short + long-running)
IP Hash	Client IP hash determines the server	Sticky sessions without cookies
Weighted Round Robin	Proportional to server weights	Servers of different capacity (2:1:1)
Random	Random server	Simple implementation, works well with many servers

**Health checks** - the load balancer pokes each backend on a schedule to confirm it is alive. If a server stops responding, traffic shifts to the rest. Skip health checks and a dead server keeps receiving requests while users stare at errors.

The /health endpoint is the standard pattern. The load balancer fires GET /health every N seconds; a 200 response means the server is alive. A good health check verifies more than "process running" - it also confirms "database reachable, Redis responding, disk not full".

There are 3 backend servers behind nginx. One goes down. What happens to requests with max_fails=3 fail_timeout=30s?

Firewalls and Network Security

Server reachable from the internet. No **firewall**, and any scanner finds PostgreSQL (5432), Redis (6379), SSH (22) and starts knocking. In 2017, thousands of Redis servers fell to exactly this: no firewall, no password - full server access via CONFIG SET. A firewall is the filter that decides which packets pass and which get dropped.

**Stateful vs Stateless.** Stateless inspects each packet in isolation - primitive but fast. Stateful tracks connections: an outgoing HTTP request was allowed, so response packets are permitted automatically. Every modern production firewall is stateful.

nftables succeeds iptables on modern Linux. Cleaner syntax, better performance. Ubuntu 22.04+ ships nftables as the default. iptables still works, but as a wrapper over nftables.

In the cloud (AWS, GCP, Azure), iptables gives way to **Security Groups** - virtual firewalls attached to an instance. Same principle: default deny, explicit allow. Terraform lets the entire security group live in code.

Port	Service	Who Should Have Access
22	SSH	VPN / bastion host / office IP only
80	HTTP	Everyone (0.0.0.0/0)
443	HTTPS	Everyone (0.0.0.0/0)
5432	PostgreSQL	Backend servers only (10.0.0.0/24)
6379	Redis/KeyDB	Backend servers only (NEVER expose to internet!)
3000-9000	App servers	Only via Load Balancer

Redis/Memcached exposed to the internet ranks among the most common vulnerabilities in the wild. In 2017, thousands of Redis servers fell to missing firewalls. Attackers used Redis CONFIG SET to plant SSH keys and walked straight onto the box.

A firewall provides complete server protection - if ports are closed, hacking is impossible

A firewall is just one layer of protection in a defense-in-depth strategy. Open ports (80, 443) are still vulnerable to application-level attacks (SQL injection, XSS, RCE)

A firewall filters traffic by IP/ports but doesn't inspect HTTP request content. An attacker can exploit a vulnerability in the application through the allowed port 443. All layers are needed: firewall + WAF + software updates + monitoring + least privilege principle.

A server has SSH (22), HTTP (80), HTTPS (443), PostgreSQL (5432), and Redis (6379) open to all IPs (0.0.0.0/0). What needs to be fixed FIRST?

Key Takeaways

TCP/IP - 4 layers: Link → Internet → Transport → Application. TCP guarantees delivery; UDP is fast but without guarantees
DNS - hierarchical system (root → TLD → authoritative). TTL determines cache duration. Lower TTL before migrations
Load Balancing - L4 (by IP/port, fast) vs L7 (by HTTP, smart). Algorithms: round-robin, least connections, IP hash
Health checks - the load balancer checks backend health via a /health endpoint
Firewall - default deny, open only what's needed. Databases and caches - NEVER expose to the internet
Facebook 2021: DNS and BGP are the backbone of availability. A single network misconfiguration can take down an entire service

Вопросы для размышления

A service is responding slowly. How can curl, dig, and ss help determine whether the problem is DNS, networking, or the application?
Why should databases (PostgreSQL, Redis) never be accessible from the internet, even with a strong password?
For a service with 50,000 RPS - which load balancing type (L4 or L7) and which algorithm are optimal? Why?

Связанные уроки

devops-01 — SLI/SLO from lesson 1 measure latency and availability
devops-02 — Network utilities are part of the Linux toolkit
devops-04 — Container networking builds on TCP/IP and DNS
devops-06 — Kubernetes networking: pods, services, ingress
sec-01 — Firewall rules are the first layer in defense-in-depth
cloud-03 — CDN and GeoDNS apply the same DNS principles at a global scale
net-01-intro
net-47-container-networking