DevOps
Ansible and Configuration Management
2015. OpenAI launches its first 100-machine GPU cluster. Installing CUDA, PyTorch, monitoring, and SSH keys on every box in one day - one Ansible playbook. Without it, five engineers and a week. Today Anthropic, Mistral, and DeepSeek do the same: a GPU fleet held together by YAML.
- **Red Hat bought Ansible for USD 150M** in 2015 - one of the biggest DevOps exits ever
- **NASA uses Ansible** to manage scientific instruments across continents
- **Cisco and Juniper** integrated Ansible as the standard for network automation
- **ML infrastructure:** Anthropic, OpenAI, Mistral ship inference deployments on GPU fleets as Ansible roles
Playbooks: YAML That Knows How to Admin
2012. Michael DeHaan looks at Puppet and Chef and wants to throw up. Both require an agent on every machine, both ship their own DSL, both are monsters. He writes Ansible in his evenings. The idea: ssh + Python on the remote side (already there anyway) + YAML describing 'what should be'. That is it. Three years later Red Hat buys the company for USD 150M.
A playbook is a YAML file with the *desired state*. Not 'run this command' but 'this package must be installed'. Not 'sed the config' but 'this line must be present in the file'. Declarative - the same core as Terraform, Kubernetes manifests, or declare in a Stable Diffusion pipeline. The endpoint is described, the engine figures out how to get there.
Each task launches one module - a ready-made Python script on the target side. Modules do everything: install packages, copy files, restart services, hit AWS APIs. Standard collection - 3000+ modules. One YAML can roll out LLM inference on 100 GPU machines in five minutes.
**Agentless is the key advantage:** nothing has to be installed on remote hosts. SSH access and Python (already present on any modern Linux) are enough. Same advantage as serverless over Kubernetes: fewer moving parts means fewer failure points.
What architecturally separates Ansible from Puppet and Chef?
Roles: Playbook Organization, LoRA-style
Once a playbook hits 500 lines, the pain starts. The same tasks copy across projects. Configs tangle with tasks, variables with handlers. A role is a standard folder layout that slices a playbook into reusable parts. Write nginx as a role once, apply it in every project.
The structure is strict: `tasks/main.yml` - what to do, `defaults/main.yml` - default variables, `templates/` - Jinja2 templates, `handlers/main.yml` - reactions to changes, `files/` - static assets. Same modular logic as LoRA adapters: the base model stays untouched, task-specific layers are bolted on top.
Ansible Galaxy is the public role registry. 30 thousand prebuilt roles: nginx, kafka, postgres, monitoring, kubernetes. The community of vmware, geerlingguy, jeff-geerling is the de facto standard. Hugging Face for DevOps: download somebody else's, apply your own.
**Variable precedence:** Ansible has 22 levels of override, from defaults all the way to CLI extra-vars. Sounds insane, solves a real problem: env-specific override without copy-paste. Same logic as Hydra-config or OmegaConf for ML engineers - inherit and override in layers.
What is the main purpose of carving Ansible playbooks into roles?
Inventory: The World Map in YAML
Ansible must know which machines to visit. Inventory is the list of hosts grouped by role. The simplest format is an INI file with groups in brackets; the richer one is YAML with nested groups and variables. Modern projects keep inventory in Git alongside playbooks.
Dynamic inventory is inventory generated by a script. The AWS plugin walks the EC2 API and returns instances with tags as groups. The Kubernetes plugin turns pods into hosts. Manual lists disappear: the cloud itself tells what it has. Same logic as service discovery in Kubernetes - the registry lives in one place, consumers poll it.
Groups nest. `[production:children]` unites `webservers`, `dbservers`, `cache`. Variables at the group level are inherited by every host in the group. At the host level they can be overridden. That gives environment-specific configs (dev/staging/prod) without duplication.
**ansible-vault:** sensitive variables (DB passwords, API keys) are AES-256 encrypted in the Git repo itself. The master key lives separately (vault password file or a CI secret). This solves an old pain: code in Git, secrets not in Git, but logically linked. Same solution as sealed-secrets in k8s or Mozilla SOPS.
Why use dynamic inventory instead of a static list?
Idempotency: Run It Ten Times, Nothing Breaks
The most important property of Ansible: running the same playbook twice must change nothing if the state is already desired. That is idempotency. Without it infrastructure becomes a house of cards: every run could add, remove, or break something.
Every module knows how to check the current state before acting. `apt: name=nginx state=present` first asks whether the package is installed. If yes - skip. If not - install. Returns `changed: false` or `changed: true`. Same logic as DDIM with deterministic sampling in diffusion: identical inputs produce identical outputs regardless of step count.
Commands via `shell` and `command` modules are an anti-pattern precisely because of idempotency. They run unconditionally. That is why `creates`/`removes` exist - guard conditions. And the rule: always use a specialised module when one exists. `file:` instead of `chmod`. `git:` instead of `git clone`. `lineinfile:` instead of `sed`.
**Idempotency check on CI:** professionals run the playbook twice in a row - the first does the changes, the second must report `changed: 0`. If the second run still changes anything, the playbook has hidden drift and must be fixed before merge. Same practice as Terraform: plan twice, expect a no-op.
Idempotency just means 'does not crash on a second run'
Idempotency means 'the second run reports zero changes because the state has already converged to the desired one'
Not crashing is a weak bar. Stable state requires every module to check current state and skip work when no change is needed. Then the playbook becomes the reference state and any drift is caught with `--check` in CI. Without it infrastructure is a pile of hacks that only works on the first run
Which playbook execution correctly reflects idempotency?
Related topics
Where Ansible leads next:
- Docker and containerization — Ansible often deploys Docker hosts and orchestrates docker-compose deployments
- CI/CD pipelines — Ansible as one CI step - apply changes after successful tests
- Linux internals — Ansible lives on ssh, systemd, apt/yum - Linux is the foundation of every task
- Observability stack — Rolling out node_exporter, fluent-bit, OpenTelemetry collector - the classic Ansible scenario
Key ideas
- Ansible is agentless: ssh plus Python on the target, no daemons
- Playbook is declarative YAML with the desired state, not imperative commands
- Roles are standard reuse structure - the LoRA adapters of infrastructure
- Inventory is the world map with groups and variables, dynamic from cloud APIs
- Idempotency means: run ten times, second and later runs report zero changes
Вопросы для размышления
- When does Ansible fit better than Terraform for the same task?
- Where to draw the line between Ansible roles and Docker images in your own infra?
- How to make tasks idempotent when they are intrinsically mutating (DB migrations, for instance)?
Связанные уроки
- devops-01 — Basic Linux is needed - Ansible lives on ssh and shell commands
- devops-04 — Docker hosts are often provisioned via Ansible playbooks
- devops-08 — CI/CD pipelines run Ansible as one of their steps
- os-19-containers — Containers solve the same class of problems through isolation rather than configuration
- sd-22-observability — Rolling out node_exporter, fluent-bit on a fleet is a textbook Ansible scenario
- aie-44-ai-backend-node — Deploying LLM inference on a GPU fleet often ships as an Ansible role
- os-21-linux-internals