Software Engineering

SRE: Site Reliability Engineering

2003: Google hires the first SRE - Ben Treynor Sloss. Task: ensure reliability of a search engine processing billions of requests per day. Result: a new engineering discipline. Today SRE is the standard model for reliability at Netflix, Airbnb, Lyft, Uber, and thousands of other companies.

**Google SRE Book**: open book on SRE practices (sre.google/sre-book). Became the industry standard, described how Google achieves reliability at scale with engineering, not manual operations.
**Atlassian Statuspage**: the company publishes SLOs for its products (Jira, Confluence) publicly - external SLA. Error budget policy determines when new features pause for reliability work.
**PagerDuty**: incident management platform used by thousands of SRE teams. PagerDuty data: average MTTR (Mean Time To Recovery) for teams with blameless postmortems is 2x lower than teams focused on blame.

SLO, SLI, SLA

SRE (Site Reliability Engineering) is a discipline created at Google in 2003 by Ben Treynor Sloss. The core idea: reliability is a software engineering problem. SRE teams apply software engineering practices to operations: automation, measuring reliability quantitatively, and managing risk through error budgets.

Good SLIs: measurable in real time, reflect user experience (not server CPU, but request success rate), controllable by the team. SLO (Service Level Objective) - internal target (99.9% availability). SLA (Service Level Agreement) - external contract with customers, consequences for violation. SLO must be stricter than SLA to provide a buffer.

SLO for the service: p99 latency < 200ms. This month, 0.2% of requests responded in 250ms. Is the SLO violated?

Error Budget

Error Budget is the allowable failure quantity within an SLO for a period. SLO 99.9% availability per month -> error budget = 0.1% of requests = 43.2 minutes of downtime. The error budget quantifies how much risk the team can take with new deployments.

Error budget policy - a formal agreement: if the error budget is exhausted, feature development stops until reliability is restored. This creates a shared interest between development and operations: both want a healthy error budget. When the budget is healthy, teams can take more risks (new deploys, experiments).

A team has exhausted its error budget on day 20 of a 30-day month. According to error budget policy, what is the correct action?

Toil and Automation

Toil in SRE is manual, repetitive, automatable operational work without long-term value. Examples: manual deploys by checklist (30 min each), restarting services every Monday morning, manually resizing VMs during traffic spikes. Toil is tactical not strategic - it does not improve the system.

Measuring toil: number of manual operations per week, time spent on repetitive tasks vs engineering. Test: 'if we doubled traffic, would toil double?' If yes - it is toil. Google SRE: 50% cap on toil. If more than 50% of time is toil, the team gets engineering time to automate.

An SRE spends 60% of their time on toil: manual deploys, alert response, restarts. What is the correct action according to SRE principles?

Incident Management

Incident management is a structured process for responding to production problems. Google SRE incident severity: SEV1 (complete service outage, all hands), SEV2 (significant degradation, on-call responds), SEV3 (minor issue, business hours), SEV4 (no user impact, tracked). Clear severity definitions prevent under/over-reaction.

Blameless Postmortem is an analysis of the incident after resolution. Key principle: not finding who is guilty, but finding systemic problems. John Allspaw (Etsy): 'if the same person were placed in the same situation, they would make the same choice'. The goal is to change the system, not punish individuals.

Blameless culture requires psychological safety: engineers must feel safe reporting problems without fear of punishment. Google found that blameless postmortems increase incident reporting - teams surface near-misses before they become incidents, improving the overall reliability system.

SRE is just renamed system administrators with a new title

SRE introduces an engineering approach to operations: software solutions for reliability problems, quantitative measurement of reliability through SLO/SLI, error budgets as a risk management mechanism, and elimination of toil through automation.

Traditional sysadmins manage systems through manual operations and institutional knowledge. SREs write code to automate operations, treat reliability as a software problem, and use error budgets to make risk/reward trade-offs explicit - fundamentally different from traditional operations.

A postmortem revealed that developer Ivan deployed code with a bug that caused the incident. The blameless principle means:

Key Ideas

**SLO/SLI/SLA**: quantitative reliability targets. SLI measures user experience, SLO is the internal team target, SLA is the external customer contract. SLO must be stricter than SLA.
**Error Budget**: 100% - SLO = allowed failure rate per period. Exhausted budget = freeze features, focus on reliability. Shared incentive between development and operations.
**Toil**: manual repetitive operational work. Google SRE: 50% cap on toil. Exceeding it triggers engineering work to automate - SRE's core value proposition.

Вопросы для размышления

SLO 99.99% sounds better than 99.9% - why is maximum availability not always the right target?
How does error budget policy change the interaction between development and operations teams?
Blameless postmortem culture - what prevents its adoption in teams and how to overcome resistance?

Связанные уроки

devops-10

Software Engineering

SRE: Site Reliability Engineering

**Google SRE Book**: open book on SRE practices (sre.google/sre-book). Became the industry standard, described how Google achieves reliability at scale with engineering, not manual operations.
**Atlassian Statuspage**: the company publishes SLOs for its products (Jira, Confluence) publicly - external SLA. Error budget policy determines when new features pause for reliability work.
**PagerDuty**: incident management platform used by thousands of SRE teams. PagerDuty data: average MTTR (Mean Time To Recovery) for teams with blameless postmortems is 2x lower than teams focused on blame.

SLO, SLI, SLA

SLO for the service: p99 latency < 200ms. This month, 0.2% of requests responded in 250ms. Is the SLO violated?

Error Budget

A team has exhausted its error budget on day 20 of a 30-day month. According to error budget policy, what is the correct action?

Toil and Automation

An SRE spends 60% of their time on toil: manual deploys, alert response, restarts. What is the correct action according to SRE principles?

Incident Management

SRE is just renamed system administrators with a new title

A postmortem revealed that developer Ivan deployed code with a bug that caused the incident. The blameless principle means:

Key Ideas

**SLO/SLI/SLA**: quantitative reliability targets. SLI measures user experience, SLO is the internal team target, SLA is the external customer contract. SLO must be stricter than SLA.
**Error Budget**: 100% - SLO = allowed failure rate per period. Exhausted budget = freeze features, focus on reliability. Shared incentive between development and operations.
**Toil**: manual repetitive operational work. Google SRE: 50% cap on toil. Exceeding it triggers engineering work to automate - SRE's core value proposition.

Вопросы для размышления

SLO 99.99% sounds better than 99.9% - why is maximum availability not always the right target?
How does error budget policy change the interaction between development and operations teams?
Blameless postmortem culture - what prevents its adoption in teams and how to overcome resistance?

Связанные уроки

devops-10

SRE: Site Reliability Engineering

SLO, SLI, SLA

Error Budget

Toil and Automation

Incident Management

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

SRE: Site Reliability Engineering

SLO, SLI, SLA

Error Budget

Toil and Automation

Incident Management

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки