Computer Networks
Network Monitoring
Amazon loses $66,240 per minute of downtime. Netflix - $55,000. How much does a minute of downtime cost your service? Monitoring is the eyes and ears that see the problem before the user opens a ticket.
- **Cloudflare (June 2022)** - 19-minute outage took down half the internet. Detected in 3 minutes thanks to monitoring. Without it, they would have found out from Twitter
- **Facebook (October 2021)** - 6-hour outage. BGP monitoring showed Facebook disappearing from routing tables instantly. Finding the cause took much longer
- **Netflix** uses thousands of metrics per microservice. Anomaly detection finds degradation before user complaints
Предварительные знания
SNMP: Management Protocol
**SNMP (Simple Network Management Protocol)** - the standard for monitoring network devices. Routers, switches, and servers export metrics via SNMP: CPU load, port traffic, temperature. A collector (Zabbix, Nagios, PRTG) polls devices and builds graphs.
**MIB (Management Information Base)** - the object tree of a device. Each parameter has an OID (Object ID): 1.3.6.1.2.1.1.1.0 = sysDescr (system description). Vendors add their own branches: Cisco = 1.3.6.1.4.1.9.*, Juniper = 1.3.6.1.4.1.2636.*
How does an SNMP TRAP differ from a GET?
NetFlow and IPFIX
**NetFlow** - Cisco's technology for exporting traffic information. Unlike SNMP (byte counters), NetFlow shows who is communicating with whom. A flow = src IP + dst IP + src port + dst port + protocol. You can see: 192.168.1.5 → 8.8.8.8:443, 1.5 GB over an hour.
**IPFIX** - the standardization of NetFlow (RFC 7011). Cisco NetFlow v9 became the basis for IPFIX. Alternatives: sFlow (sampling, less overhead), jFlow (Juniper). All solve the same problem - traffic visibility.
When should sFlow be used instead of NetFlow?
Key Network Metrics
**Network Metrics** - indicators of network health. They can be grouped into: availability (up/down), performance (latency, jitter, throughput), and utilization (bandwidth usage, CPU, memory). The right set of metrics provides visibility without noise.
**USE Method** (Utilization, Saturation, Errors) - an analysis method by Brendan Gregg. For each resource (link, CPU, buffer): how much is being used? is there a queue? are there errors? A systematic approach to bottleneck hunting.
What does high jitter with normal latency indicate?
Alerting: Smart Notifications
**Alerting** - notifications about problems. Goal: learn about a problem before users do. Anti-pattern: alert on every twitch → alert fatigue → ignored alerts. You need meaningful alerts with context and prioritization.
**Symptom vs Cause alerts:** Alert on the symptom (latency > 500ms), not the cause (CPU > 90%). There can be many causes, but only one symptom - bad UX. Cause alerts are useful for troubleshooting, not for paging.
More alerts = better monitoring
Every alert must be actionable. If an alert requires no action - it is a metric, not an alert
Alert fatigue is a real problem. Teams start ignoring alerts when there are too many. Better to have 5 critical alerts per week with a clear action than 50 false positives per day.
Why is 'for: 5m' important in an alert rule?
Key Ideas
- **SNMP** - pull model for device metrics. MIB describes the parameter tree. SNMPv3 = security
- **NetFlow/IPFIX** - traffic visibility: who communicates with whom, how much data. sFlow for high-speed links
- **Golden Signals**: Latency, Jitter, Packet Loss, Throughput. USE Method: Utilization, Saturation, Errors
- **Alerting rules:** symptom-based, with 'for' delay, actionable. Alert fatigue is the enemy of monitoring
- **Runbooks** attached to alerts - what to do, not who is to blame. Automate common actions
Related Topics
Monitoring works together with other systems:
- ICMP and Ping — Basic availability check via ICMP echo
- Packet Analysis — Deep analysis when metrics indicate a problem
- DDoS Protection — NetFlow helps detect anomalous traffic
Вопросы для размышления
- How would you set up monitoring for a network where you cannot use agents (IoT devices, legacy)?
- Which metrics are most important for a VoIP application? For video streaming? For a web application?
- How can you distinguish a real problem from a false positive without human intervention?