Software Engineering
SRE: Site Reliability Engineering
2003: Google hires the first SRE - Ben Treynor Sloss. Task: ensure reliability of a search engine processing billions of requests per day. Result: a new engineering discipline. Today SRE is the standard model for reliability at Netflix, Airbnb, Lyft, Uber, and thousands of other companies.
- **Google SRE Book**: open book on SRE practices (sre.google/sre-book). Became the industry standard, described how Google achieves reliability at scale with engineering, not manual operations.
- **Atlassian Statuspage**: the company publishes SLOs for its products (Jira, Confluence) publicly - external SLA. Error budget policy determines when new features pause for reliability work.
- **PagerDuty**: incident management platform used by thousands of SRE teams. PagerDuty data: average MTTR (Mean Time To Recovery) for teams with blameless postmortems is 2x lower than teams focused on blame.
SLO, SLI, SLA
SRE (Site Reliability Engineering) is a discipline created at Google in 2003 by Ben Treynor Sloss. The core idea: reliability is a software engineering problem. SRE teams apply software engineering practices to operations: automation, measuring reliability quantitatively, and managing risk through error budgets.
Good SLIs: measurable in real time, reflect user experience (not server CPU, but request success rate), controllable by the team. SLO (Service Level Objective) - internal target (99.9% availability). SLA (Service Level Agreement) - external contract with customers, consequences for violation. SLO must be stricter than SLA to provide a buffer.
SLO for the service: p99 latency < 200ms. This month, 0.2% of requests responded in 250ms. Is the SLO violated?
Error Budget
Error Budget is the allowable failure quantity within an SLO for a period. SLO 99.9% availability per month -> error budget = 0.1% of requests = 43.2 minutes of downtime. The error budget quantifies how much risk the team can take with new deployments.
Error budget policy - a formal agreement: if the error budget is exhausted, feature development stops until reliability is restored. This creates a shared interest between development and operations: both want a healthy error budget. When the budget is healthy, teams can take more risks (new deploys, experiments).
A team has exhausted its error budget on day 20 of a 30-day month. According to error budget policy, what is the correct action?
Toil and Automation
Toil in SRE is manual, repetitive, automatable operational work without long-term value. Examples: manual deploys by checklist (30 min each), restarting services every Monday morning, manually resizing VMs during traffic spikes. Toil is tactical not strategic - it does not improve the system.
Measuring toil: number of manual operations per week, time spent on repetitive tasks vs engineering. Test: 'if we doubled traffic, would toil double?' If yes - it is toil. Google SRE: 50% cap on toil. If more than 50% of time is toil, the team gets engineering time to automate.
An SRE spends 60% of their time on toil: manual deploys, alert response, restarts. What is the correct action according to SRE principles?
Incident Management
Incident management is a structured process for responding to production problems. Google SRE incident severity: SEV1 (complete service outage, all hands), SEV2 (significant degradation, on-call responds), SEV3 (minor issue, business hours), SEV4 (no user impact, tracked). Clear severity definitions prevent under/over-reaction.
Blameless Postmortem is an analysis of the incident after resolution. Key principle: not finding who is guilty, but finding systemic problems. John Allspaw (Etsy): 'if the same person were placed in the same situation, they would make the same choice'. The goal is to change the system, not punish individuals.
Blameless culture requires psychological safety: engineers must feel safe reporting problems without fear of punishment. Google found that blameless postmortems increase incident reporting - teams surface near-misses before they become incidents, improving the overall reliability system.
SRE is just renamed system administrators with a new title
SRE introduces an engineering approach to operations: software solutions for reliability problems, quantitative measurement of reliability through SLO/SLI, error budgets as a risk management mechanism, and elimination of toil through automation.
Traditional sysadmins manage systems through manual operations and institutional knowledge. SREs write code to automate operations, treat reliability as a software problem, and use error budgets to make risk/reward trade-offs explicit - fundamentally different from traditional operations.
A postmortem revealed that developer Ivan deployed code with a bug that caused the incident. The blameless principle means:
Key Ideas
- **SLO/SLI/SLA**: quantitative reliability targets. SLI measures user experience, SLO is the internal team target, SLA is the external customer contract. SLO must be stricter than SLA.
- **Error Budget**: 100% - SLO = allowed failure rate per period. Exhausted budget = freeze features, focus on reliability. Shared incentive between development and operations.
- **Toil**: manual repetitive operational work. Google SRE: 50% cap on toil. Exceeding it triggers engineering work to automate - SRE's core value proposition.
Related Topics
SRE connects development processes with production reliability:
- Observability: Logs, Metrics, Traces — SLIs are measured through observability tools. Without good monitoring, SLOs are aspirational rather than enforceable
- Chaos Engineering — Chaos experiments proactively verify that reliability targets hold under failure conditions - the offensive complement to SRE's defensive monitoring
Вопросы для размышления
- SLO 99.99% sounds better than 99.9% - why is maximum availability not always the right target?
- How does error budget policy change the interaction between development and operations teams?
- Blameless postmortem culture - what prevents its adoption in teams and how to overcome resistance?