DevOps
Networking for DevOps
Цели урока
- Understand the TCP/IP model and distinguish TCP vs UDP by use case
- Use dig, curl, ss for diagnosing network issues
- Configure DNS records and understand TTL for migrations
- Distinguish L4 and L7 load balancing, choose the right algorithm
- Apply default-deny when configuring firewalls
Предварительные знания
October 4, 2021. Facebook, Instagram, WhatsApp vanished for 6 hours. Engineers couldn't badge into the data centers - their badges ran on the same network. Cause: a single BGP command nuked the company's DNS records. 3.5 billion users offline, tens of millions of dollars in losses. One network error. One config line.
- **Every HTTP request** goes through DNS, TCP/IP, possibly a load balancer and a firewall. Without networking knowledge, no production incident can be debugged
- **Kubernetes networking** - pods, services, ingress - all based on TCP/IP, DNS, and L7 load balancing
- **Microservices** communicate over the network. Network issues (timeout, DNS, firewall) are the number-one cause of incidents in distributed systems
- **Security** - an open Redis = compromised in minutes. Understanding ports and firewalls saves companies from data breaches
The Birth of TCP/IP
In 1974, Vint Cerf and Bob Kahn published a paper on the TCP protocol - the foundation of the internet. The early ARPANET used NCP, but TCP/IP proved so successful that on January 1, 1983, ARPANET fully switched to it - a day known as the "birthday of the internet". Cerf and Kahn weren't thinking about Netflix and Kubernetes - they were thinking about reliable military communication. The protocol outlived them all.
TCP/IP: How Data Travels Across a Network
October 2021. Facebook engineers can't badge into the data centers - their badges ride the same network that just vanished. 3.5 billion users, 6 hours of silence, root cause: a single line in a BGP config. Network errors don't look like errors until they take down everything at once. Every HTTP request crosses **4 layers** of the TCP/IP model, and every layer is a potential failure point.
**TCP vs UDP** - two transport protocols with opposite guarantees. TCP is a contract: every packet delivered, order kept, losses retransmitted. UDP is a volley: faster, no promises. Streaming video would rather drop a frame than buffer for three seconds.
| Characteristic | TCP | UDP |
|---|---|---|
| Delivery guarantee | Yes (retransmission) | No |
| Packet ordering | Guaranteed | Not guaranteed |
| Connection setup | 3-way handshake | None (connectionless) |
| Speed | Slower (overhead) | Faster |
| Use cases | HTTP, SSH, PostgreSQL, email | DNS, video, VoIP, gaming |
**3-way handshake** - a TCP connection comes up in three steps: SYN, SYN-ACK, ACK. Only then does data start flowing. That handshake is the "before-the-first-byte" latency - TTFB, the metric staring back from every DevOps dashboard.
**Ports** - numbers from 0 to 65535 that pin a specific application. The IP address is the building, the port is the apartment number. 127.0.0.1:5432 - PostgreSQL only takes local connections. 0.0.0.0:5432 - PostgreSQL listens on every interface.
PostgreSQL is listening on 127.0.0.1:5432. Can it be reached from another server?
DNS: The Internet's Phone Book
People remember names (google.com); computers run on numbers (142.250.74.46). **DNS** - a distributed hierarchical database, online since 1983, bridges the gap. When Facebook went dark in 2021, a BGP error wiped its DNS records and the entire internet stopped seeing Facebook - even though the servers were physically running.
DevOps engineers live and breathe **DNS record types** - configuring them is part of every new service deployment.
| Record Type | Purpose | Example |
|---|---|---|
| A | Domain → IPv4 address | api.example.com → 203.0.113.42 |
| AAAA | Domain → IPv6 address | api.example.com → 2001:db8::1 |
| CNAME | Alias (domain → domain) | www.example.com → example.com |
| MX | Mail server | example.com → mail.example.com (priority 10) |
| TXT | Arbitrary text | SPF, DKIM, domain verification |
| NS | Nameserver for zone | example.com → ns1.registrar.com |
**TTL** (Time To Live) - how long a DNS record stays cached, in seconds. High TTL (86400 = 24 hours) - lighter DNS load, but slow cutover during migration. Low TTL (60 = 1 minute) - fast cutover, more queries hitting the server.
Before a server migration: drop the TTL to 60 seconds 24-48 hours BEFORE the cutover. When the IP flips, users follow within a minute instead of a full day. After migration, raise the TTL back.
DNS caching produces surprises. The A record is updated, but some users still hit the old IP because the previous record's TTL has not expired. Wait out the TTL, or query a different resolver via `dig @8.8.8.8` - no cache there.
A domain's A record was updated (TTL was 86400). Two hours have passed, but some users still see the old IP. Why?
Load Balancing
One server handles 1,000 RPS. Traffic grows to 10,000 RPS. Horizontal scaling: spin up many servers and slot a **load balancer** in front. This is exactly how Netflix rides peaks of 80 million simultaneous streams - not one giant box, but thousands of machines behind a smart load balancer.
**L4 vs L7** - the headline distinction. L4 runs at the transport layer (TCP), sees only IP and port - blistering speed, no smarts. L7 runs at the application layer (HTTP), reads URLs, headers, cookies - and routes `/api/*` to the backend cluster and `/static/*` to a CDN.
| Algorithm | How It Works | When to Use |
|---|---|---|
| Round Robin | In turn: 1→2→3→1→2→3 | Same-capacity servers, stateless applications |
| Least Connections | Route to server with fewest active connections | Mixed request weight (short + long-running) |
| IP Hash | Client IP hash determines the server | Sticky sessions without cookies |
| Weighted Round Robin | Proportional to server weights | Servers of different capacity (2:1:1) |
| Random | Random server | Simple implementation, works well with many servers |
**Health checks** - the load balancer pokes each backend on a schedule to confirm it is alive. If a server stops responding, traffic shifts to the rest. Skip health checks and a dead server keeps receiving requests while users stare at errors.
The /health endpoint is the standard pattern. The load balancer fires GET /health every N seconds; a 200 response means the server is alive. A good health check verifies more than "process running" - it also confirms "database reachable, Redis responding, disk not full".
There are 3 backend servers behind nginx. One goes down. What happens to requests with max_fails=3 fail_timeout=30s?
Firewalls and Network Security
Server reachable from the internet. No **firewall**, and any scanner finds PostgreSQL (5432), Redis (6379), SSH (22) and starts knocking. In 2017, thousands of Redis servers fell to exactly this: no firewall, no password - full server access via CONFIG SET. A firewall is the filter that decides which packets pass and which get dropped.
**Stateful vs Stateless.** Stateless inspects each packet in isolation - primitive but fast. Stateful tracks connections: an outgoing HTTP request was allowed, so response packets are permitted automatically. Every modern production firewall is stateful.
nftables succeeds iptables on modern Linux. Cleaner syntax, better performance. Ubuntu 22.04+ ships nftables as the default. iptables still works, but as a wrapper over nftables.
In the cloud (AWS, GCP, Azure), iptables gives way to **Security Groups** - virtual firewalls attached to an instance. Same principle: default deny, explicit allow. Terraform lets the entire security group live in code.
| Port | Service | Who Should Have Access |
|---|---|---|
| 22 | SSH | VPN / bastion host / office IP only |
| 80 | HTTP | Everyone (0.0.0.0/0) |
| 443 | HTTPS | Everyone (0.0.0.0/0) |
| 5432 | PostgreSQL | Backend servers only (10.0.0.0/24) |
| 6379 | Redis/KeyDB | Backend servers only (NEVER expose to internet!) |
| 3000-9000 | App servers | Only via Load Balancer |
Redis/Memcached exposed to the internet ranks among the most common vulnerabilities in the wild. In 2017, thousands of Redis servers fell to missing firewalls. Attackers used Redis CONFIG SET to plant SSH keys and walked straight onto the box.
A firewall provides complete server protection - if ports are closed, hacking is impossible
A firewall is just one layer of protection in a defense-in-depth strategy. Open ports (80, 443) are still vulnerable to application-level attacks (SQL injection, XSS, RCE)
A firewall filters traffic by IP/ports but doesn't inspect HTTP request content. An attacker can exploit a vulnerability in the application through the allowed port 443. All layers are needed: firewall + WAF + software updates + monitoring + least privilege principle.
A server has SSH (22), HTTP (80), HTTPS (443), PostgreSQL (5432), and Redis (6379) open to all IPs (0.0.0.0/0). What needs to be fixed FIRST?
Key Takeaways
- TCP/IP - 4 layers: Link → Internet → Transport → Application. TCP guarantees delivery; UDP is fast but without guarantees
- DNS - hierarchical system (root → TLD → authoritative). TTL determines cache duration. Lower TTL before migrations
- Load Balancing - L4 (by IP/port, fast) vs L7 (by HTTP, smart). Algorithms: round-robin, least connections, IP hash
- Health checks - the load balancer checks backend health via a /health endpoint
- Firewall - default deny, open only what's needed. Databases and caches - NEVER expose to the internet
- Facebook 2021: DNS and BGP are the backbone of availability. A single network misconfiguration can take down an entire service
Related Topics
Networking knowledge is the foundation for containerization, orchestration, and cloud infrastructure:
- Linux for DevOps — Network utilities (curl, ss, tcpdump, iptables) are part of the Linux toolkit
- What is DevOps — SLI/SLO from lesson 1 measure network characteristics: latency, availability
Вопросы для размышления
- A service is responding slowly. How can curl, dig, and ss help determine whether the problem is DNS, networking, or the application?
- Why should databases (PostgreSQL, Redis) never be accessible from the internet, even with a strong password?
- For a service with 50,000 RPS - which load balancing type (L4 or L7) and which algorithm are optimal? Why?
Связанные уроки
- devops-01 — SLI/SLO from lesson 1 measure latency and availability
- devops-02 — Network utilities are part of the Linux toolkit
- devops-04 — Container networking builds on TCP/IP and DNS
- devops-06 — Kubernetes networking: pods, services, ingress
- sec-01 — Firewall rules are the first layer in defense-in-depth
- cloud-03 — CDN and GeoDNS apply the same DNS principles at a global scale
- net-01-intro
- net-47-container-networking