Computer Networks

Network Monitoring

Amazon loses $66,240 per minute of downtime. Netflix - $55,000. How much does a minute of downtime cost your service? Monitoring is the eyes and ears that see the problem before the user opens a ticket.

**Cloudflare (June 2022)** - 19-minute outage took down half the internet. Detected in 3 minutes thanks to monitoring. Without it, they would have found out from Twitter
**Facebook (October 2021)** - 6-hour outage. BGP monitoring showed Facebook disappearing from routing tables instantly. Finding the cause took much longer
**Netflix** uses thousands of metrics per microservice. Anomaly detection finds degradation before user complaints

Предварительные знания

ICMP, Ping, and Traceroute

SNMP: Management Protocol

**SNMP (Simple Network Management Protocol)** - the standard for monitoring network devices. Routers, switches, and servers export metrics via SNMP: CPU load, port traffic, temperature. A collector (Zabbix, Nagios, PRTG) polls devices and builds graphs.

**MIB (Management Information Base)** - the object tree of a device. Each parameter has an OID (Object ID): 1.3.6.1.2.1.1.1.0 = sysDescr (system description). Vendors add their own branches: Cisco = 1.3.6.1.4.1.9.*, Juniper = 1.3.6.1.4.1.2636.*

How does an SNMP TRAP differ from a GET?

NetFlow and IPFIX

**NetFlow** - Cisco's technology for exporting traffic information. Unlike SNMP (byte counters), NetFlow shows who is communicating with whom. A flow = src IP + dst IP + src port + dst port + protocol. You can see: 192.168.1.5 → 8.8.8.8:443, 1.5 GB over an hour.

**IPFIX** - the standardization of NetFlow (RFC 7011). Cisco NetFlow v9 became the basis for IPFIX. Alternatives: sFlow (sampling, less overhead), jFlow (Juniper). All solve the same problem - traffic visibility.

When should sFlow be used instead of NetFlow?

Key Network Metrics

**Network Metrics** - indicators of network health. They can be grouped into: availability (up/down), performance (latency, jitter, throughput), and utilization (bandwidth usage, CPU, memory). The right set of metrics provides visibility without noise.

**USE Method** (Utilization, Saturation, Errors) - an analysis method by Brendan Gregg. For each resource (link, CPU, buffer): how much is being used? is there a queue? are there errors? A systematic approach to bottleneck hunting.

What does high jitter with normal latency indicate?

Alerting: Smart Notifications

**Alerting** - notifications about problems. Goal: learn about a problem before users do. Anti-pattern: alert on every twitch → alert fatigue → ignored alerts. You need meaningful alerts with context and prioritization.

**Symptom vs Cause alerts:** Alert on the symptom (latency > 500ms), not the cause (CPU > 90%). There can be many causes, but only one symptom - bad UX. Cause alerts are useful for troubleshooting, not for paging.

More alerts = better monitoring

Every alert must be actionable. If an alert requires no action - it is a metric, not an alert

Alert fatigue is a real problem. Teams start ignoring alerts when there are too many. Better to have 5 critical alerts per week with a clear action than 50 false positives per day.

Why is 'for: 5m' important in an alert rule?

Key Ideas

**SNMP** - pull model for device metrics. MIB describes the parameter tree. SNMPv3 = security
**NetFlow/IPFIX** - traffic visibility: who communicates with whom, how much data. sFlow for high-speed links
**Golden Signals**: Latency, Jitter, Packet Loss, Throughput. USE Method: Utilization, Saturation, Errors
**Alerting rules:** symptom-based, with 'for' delay, actionable. Alert fatigue is the enemy of monitoring
**Runbooks** attached to alerts - what to do, not who is to blame. Automate common actions

Вопросы для размышления

How would you set up monitoring for a network where you cannot use agents (IoT devices, legacy)?
Which metrics are most important for a VoIP application? For video streaming? For a web application?
How can you distinguish a real problem from a false positive without human intervention?

Связанные уроки

ds-24-bloom-filter