Computer Networks

Masterclass: Troubleshooting

Outage at 3 AM. Monitoring is screaming. Users are complaining. You have 15 minutes before escalation. How do you find the problem fast instead of guessing?

**GitHub** was down for 24 hours in 2018 due to split-brain in the database. Quick diagnosis of the network partition could have cut the downtime.
**Cloudflare** had a global outage in 2020 due to a wrong BGP announcement. A proper troubleshooting flow helped find the cause in minutes.
**Facebook** was inaccessible for 6 hours in 2021 - a BGP withdrawal removed them from the internet. Diagnosis was complicated by the fact that internal tools were also not working.

Предварительные знания

Troubleshooting Methodology

**Network troubleshooting** is a systematic process for finding the root cause of a problem. Not 'let me try rebooting', but the scientific method: hypothesis → experiment → conclusion. A good diagnostician saves hours; a bad one creates new problems.

The key principle: **Divide and Conquer**. A network is a stack of layers. If HTTP isn't working, the problem can be at any layer from physical to application. Narrow the scope until you find the broken layer.

**Golden rule**: change only one thing at a time. If you changed three parameters and it started working - you don't know which one was the cause. If you changed three and it broke worse - you don't know what to roll back.

A user complains: 'the website won't open'. Where do you start diagnosing?

Common Issues

80% of network problems come from 20% of common causes. Knowing them lets you quickly check the most likely culprits before diving deep. Here is the hit-list of frequent issues.

**Common trap**: 'It works on my machine'. Always test from the user's perspective - a different machine, different network, different browser. The problem may be specific to one client.

An SSH connection is established, but transferring a large file via SCP hangs. Most likely cause?

Diagnostic Tools

Each OSI layer has its own tools. Knowing when to use which is half the battle. From simple ones (ping, ss) to heavy artillery (tcpdump, strace).

**Pro tip**: mtr (My TraceRoute) = traceroute + ping in one. Shows packet loss and latency at each hop. Run for a minute to see patterns: `mtr -r -c 60 host`

Which tool shows that an application is not closing TCP connections (connection leak)?

Case Studies

Theory is good, but real experience is invaluable. Let's go through several real diagnostic cases that show how to apply the methodology in practice.

**curl timing format** (`curl-timing.txt`): ``` time_namelookup: %{time_namelookup}\n time_connect: %{time_connect}\n time_appconnect: %{time_appconnect}\n time_starttransfer: %{time_starttransfer}\n time_total: %{time_total}\n ``` Save to a file and use with `-w @curl-timing.txt`

If ping works - the network is fine

ping only checks ICMP connectivity. TCP on another port may be blocked, MTU may be a problem for large packets

ICMP and TCP are different protocols. A firewall can pass ping but block HTTP. MTU problems only appear with large packets. Always test on the protocol and port the application actually needs.

tcpdump shows many TCP retransmissions to one server. What does this mean?

Key Takeaways

**Methodology**: Define → Gather → Analyze → Test → Fix → Document. Change one thing at a time
**Divide and Conquer**: start from the lower OSI layers and work up. Physical → L2 → L3 → L4 → L7
**Top issues**: DNS, Firewall, Routing, MTU, ARP - check these first
**Tools**: ss for TCP state, tcpdump for packets, mtr for traceroute+ping, curl -v for HTTP
**Retransmits and CLOSE-WAIT** - red flags. Former = network loss, latter = application bug

Вопросы для размышления

Think back to the last network problem you solved. What was the methodology? What could have been done better?
Which tools from this lesson have you not used yet? Try them on a working system (safely)
How would you document a troubleshooting process for knowledge transfer to your team?

Связанные уроки

alg-12-bfs