DevOps
On-Call and Incident Management
In 2021, Facebook was unavailable for 6 hours due to a BGP configuration error. Revenue loss: $60M. Impact: 3.5 billion users. The root cause was a configuration management process issue - not a hardware failure. Incident management is not just about responding faster. It is about learning faster.
- **Netflix** publishes all postmortems publicly - this has become the industry standard for blameless culture and helped hundreds of companies avoid similar incidents.
- **Atlassian** uses runbook automation: when OOM occurs in production, a heap dump is automatically collected, the service restarts, and only then is the on-call engineer paged - MTTR dropped from 15 to 3 minutes.
- **AWS** manages thousands of on-call rotations through its own PagerDuty-based system - every engineer spends 1 week per month on-call, creating direct feedback between code quality and reliability.
PagerDuty and Alerting
PagerDuty is an incident management platform that receives alerts from Prometheus, Datadog, Cloudwatch, Grafana, and routes them to the right on-call engineer based on escalation policies. The key principle: alert only when an SLO is at risk.
Alert fatigue is as dangerous as missing alerts. Google SRE rule: every alert must be actionable (engineer does something specific), urgent (cannot wait until morning), and accurate (low false positive rate). Alerts that fail any of these criteria should be deleted.
An on-call engineer receives 80 alerts during an 8-hour shift, most of which resolve automatically. What is the correct action?
Runbooks
A runbook is a step-by-step guide for handling a specific incident. An on-call engineer at 3am should not need to think - they should execute the runbook. A good runbook assumes zero context about the system.
Runbook automation goes further: tools like Opsgenie and PagerDuty can auto-execute runbook steps (restart pod, flush cache) before paging the engineer. Atlassian reports MTTR reduction from 15 to 3 minutes with automated runbooks.
A runbook says 'check the logs'. An on-call engineer cannot find the relevant logs in 5 minutes. What does this reveal about the runbook?
Blameless Postmortem
A postmortem (or incident review) analyzes an incident after it is resolved. The goal is organizational learning, not accountability. Blameless means the investigation focuses on system and process failures, not individual mistakes.
A postmortem without action items with owners and deadlines is just a historical document. Tracking action items to completion is what converts incidents into systemic improvements.
A postmortem action item reads 'improve monitoring'. What is wrong with this action item?
Escalation and SLO/Error Budget
SLA (Service Level Agreement) is a contract with customers: '99.9% uptime per month, or a refund'. SLO (Service Level Objective) is the internal target: '99.95% uptime' - stricter than SLA to provide a buffer. Error Budget is the allowed failure: 100% - SLO% = budget for incidents and planned maintenance.
Error Budget makes the reliability vs. feature velocity tradeoff explicit and quantitative. When the budget is gone, reliability work takes priority over new features - both development and SRE agree on this policy in advance.
SLO violation means the team failed and needs to work harder
SLO violation means error budget is consumed, which automatically triggers a policy response: freeze risky deployments and prioritize reliability work.
Error budget is a shared language between SRE and product teams. Violating it is not a failure - it is a signal that triggers a predefined response. The goal is to make reliability decisions data-driven, not emotional.
A team has an SLO of 99.9% for their payment API. In March, a single 47-minute outage occurred. What happens to deployments for the rest of the month?
Key Ideas
- **PagerDuty + intelligent alerts** - alert only when SLO is at risk; escalation policy ensures no incident goes unanswered.
- **Runbook** - step-by-step instructions that require no thinking at 3am: diagnosis - mitigation - escalation - rollback.
- **Blameless postmortem + Error Budget** - systemic root cause analysis with concrete Action Items prevents recurrence; error budget balances reliability and development speed.
Related Topics
Incident management integrates with the full observability stack:
- ELK Stack and Logging — Kibana is the first tool the on-call engineer opens to investigate an incident via logs.
- Distributed Tracing — Jaeger helps quickly localize the problematic service during an incident in a microservices architecture.
Вопросы для размышления
- How do you determine the right severity level for an incident that affects 5% of users but 100% of payments?
- What is needed for a blameless postmortem to actually work, rather than being a formality?
- With a 99.9% SLO error budget - should game days be conducted if the budget is almost exhausted in the first 2 weeks of the month?