DevOps

What is DevOps

In 2009, Flickr shocked the industry: **10 deploys a day** while everyone else deployed once a month. By 2019, Amazon reached a deploy every 11.7 seconds. How? Not new servers, not a magic framework - but a revolution in HOW people work together. That's DevOps.

  • **Netflix** - thousands of deploys a day thanks to Chaos Engineering and a "freedom & responsibility" culture
  • **Amazon** - a deploy every 11.7 seconds, transitioning from monolith to microservices through DevOps culture
  • **Etsy** - one of the DevOps pioneers, continuous deployment since 2010, blameless postmortems as the standard
  • **Google** - created SRE (Site Reliability Engineering) - their own take on DevOps with a focus on reliability and error budgets

The Birth of DevOps

In 2008, Patrick Debois, a Belgian IT consultant, was frustrated by the chasm between Dev and Ops. In 2009, inspired by the Flickr talk, he organized the first DevOpsDays conference in Ghent. The hashtag #devops spread on Twitter - and the movement got its name. Interestingly, the word DevOps was born out of Twitter's character limit.

DevOps Culture and CALMS

In 2009, at the Velocity conference, John Allspaw and Paul Hammond from Flickr presented the talk **"10+ Deploys Per Day"**. The industry was shocked: most companies deployed once a month, while Flickr deployed dozens of times a day. Their secret wasn't a magic tool - it was a **culture of collaboration** between developers and operations.

**CALMS** - a framework describing the five pillars of DevOps: **C**ulture, **A**utomation, **L**ean, **M**easurement, **S**haring.

Before DevOps, there was the so-called **"Wall of Confusion"** - a wall of misunderstanding between Dev and Ops. Developers wanted fast change, operators wanted stability. The result? A conflict of interests that slowed everyone down.

**Shift Left** - a key principle: move testing, security, and monitoring to earlier stages of development. The earlier a bug is found, the cheaper it is to fix. A bug in production costs 100x more than a bug caught during code review.

**Blameless postmortems** - incident reviews without finger-pointing. Amazon, Google, and Netflix run them after every major outage. The goal is not to punish, but to **find the systemic root cause** and prevent recurrence. If people fear punishment, they hide mistakes.

In 2001, Amazon had a monolith that took days to deploy. After migrating to a service-oriented architecture and adopting DevOps culture, they reached **a deploy every 11.7 seconds** (2019). It's not tool magic - it's cultural transformation.

What is the "Wall of Confusion" in the context of DevOps?

Automation: CI/CD and IaC

Culture without automation is just nice words. **Automation** is what turns DevOps principles into daily reality. Three main pillars: **CI/CD pipeline**, **Infrastructure as Code**, and **Configuration Management**.

**CI (Continuous Integration)** - every commit is automatically built and tested. **CD (Continuous Delivery/Deployment)** - tested code is automatically delivered to production. The chain: commit → build → tests → deploy.

**Infrastructure as Code (IaC)** - describing infrastructure in configuration files instead of manually configuring servers. Terraform, Pulumi, and CloudFormation create identical environments with a single command.

**Why automate?** Humans make a mistake in 1 out of every 10 repetitive actions. With 100 deploys per month, that's 10 errors. Automation makes the process **repeatable, fast, and predictable**.

AspectManual ProcessAutomated
Deploy time30–60 minutes2–5 minutes
Errors~10% of deploys<1% of deploys
Rollback"Who remembers what changed?"git revert + auto-deploy
New server2–3 days of setupterraform apply (5 min)
DocumentationGoes stale instantlyCode = documentation

What happens in a CI/CD pipeline if unit tests fail?

Metrics: DORA and SLx

**"What gets measured gets managed."** Peter Drucker said this about management, but it's especially true in DevOps. How does a team know whether its DevOps is working? That's what **DORA metrics** are for - the gold standard of the industry.

**DORA** (DevOps Research and Assessment) - a Google research group that has been studying 30,000+ teams since 2014. Their conclusion: **4 key metrics** predict the success of an IT organization.

MetricEliteHighMediumLow
Deployment FrequencyOn-demand (multiple times per day)Once a week – once a monthOnce a month – once every six monthsLess than once every six months
Lead Time for Changes< 1 hour1 day – 1 week1 week – 1 month> 6 months
MTTR (Mean Time to Recovery)< 1 hour< 1 day1 day – 1 week> 6 months
Change Failure Rate0–15%16–30%16–30%46–60%

Beyond DORA, DevOps engineers also track **SLI/SLO/SLA** - a three-tier reliability guarantee system.

**Monitoring vs Observability.** Monitoring answers the question **"what broke?"** (CPU 100%, disk full). Observability answers **"why did it break?"** - through three pillars: **logs**, **metrics**, **traces**.

Netflix uses an **error budget** - a budget for failures. If SLO = 99.9% uptime, then error budget = 0.1% = ~43 minutes of downtime per month. While the budget remains, the team can ship features. Once exhausted - only stability fixes.

A team deploys once a month, Lead Time = 3 weeks, MTTR = 2 days. What DORA level are they?

Sharing: DevOps as a Culture

Culture, Automation, and Measurement covered. Now - **Sharing**, the last pillar of CALMS. This is where the most common misconception about DevOps lives: many people think DevOps is a set of tools. Docker, Kubernetes, Terraform. Buy it, install it - and DevOps is done.

**DevOps is NOT tools.** A team can run Kubernetes and still deploy once a quarter via a Jira ticket. Or deploy 50 times a day with simple bash scripts. Tools help, but culture comes first.

**Blameless culture** - no one is to blame for outages. The **system** that allowed the failure is at fault. If a developer accidentally took down production - the question isn't "why did they do that", but "why did the system allow this to happen?". No code review? No automated tests? No canary deploy?

**Shared responsibility** - "you build it, you run it". Developers are responsible for their code in production. This isn't punishment - it's **fast feedback**. The developer who wrote the code best understands how to fix it.

**Documentation as code** - documentation lives alongside code, in the same repository. README, ADRs (Architecture Decision Records), runbooks. If documentation is in a separate Wiki - it will be stale within a week. If it's in Git next to the code - it's updated together with it.

**Three pillars of knowledge sharing:** 1) Blameless postmortems after every incident. 2) Internal tech talks and demo days. 3) Documentation in Git alongside code. All three only work if people are **not afraid** to share mistakes.

DevOps is a separate role or team responsible for deployments and servers

DevOps is a culture of shared responsibility where Dev and Ops work as one team with common goals

Creating a separate "DevOps team" often creates another wall instead of tearing down the existing one. DevOps is a set of practices (CI/CD, IaC, monitoring, blameless culture) that the ENTIRE organization must adopt - not just one department.

A company hired a "DevOps engineer" and bought Kubernetes. Six months later, deploys are still once a month. What's the problem?

Key Takeaways

  • DevOps = **CALMS**: Culture, Automation, Lean, Measurement, Sharing - not a set of tools
  • The **"Wall of Confusion"** between Dev and Ops is broken down through shared goals and shared responsibility
  • **CI/CD pipeline** automates the path from commit to production, eliminating human error
  • **DORA metrics** - 4 indicators that predict the effectiveness of an IT organization
  • **Blameless culture** - systems fail, people aren't to blame. Find the root cause, not the culprit
  • The Flickr talk "10 deploys per day" is now demystified - it's not magic, it's culture + automation + metrics

Related Topics

DevOps is the foundation on which everything else in this course is built:

  • Linux for DevOps — The operating system running 96% of servers in the world
  • Networking for DevOps — Without understanding networking, production troubleshooting is impossible

Вопросы для размышления

  • A new DevOps engineer joins a company where Dev and Ops don't talk to each other. Where should the transformation begin?
  • Which DORA metrics matter most for a startup? And for a bank?
  • Why is blameless culture so hard to implement? What prevents people from not looking for someone to blame?

Связанные уроки

  • devops-02 — Linux fundamentals are required to operate the automation pipelines introduced here
  • st-01-feedback-loops — CI/CD pipeline is a feedback loop that shortens the Dev-Ops cycle from months to minutes
  • alg-01-big-o — DORA metrics measure process efficiency the same way Big-O measures algorithm efficiency
  • sd-01-intro — System Design decisions shape what DevOps must automate and monitor
  • sec-01 — DevSecOps integrates security into the CI/CD pipeline - Shift Left principle
  • dist-03-fallacies
  • os-19-containers
What is DevOps

0

1

Sign In