DevOps

Terraform: Advanced

In 2017 a GitLab engineer accidentally ran `rm -rf` on a production database after mixing up terminal windows. A year later HashiCorp noticed that 30% of incidents at large Terraform customers were similar stories: 'wrong workspace', 'wrong state', 'wrong environment variable'. That is how Terraform Enterprise and Sentinel were born: control over WHAT can be deployed, FROM WHERE, and BY WHOM - before apply does damage.

  • **Coinbase**: 200+ engineers, multi-region infrastructure - Terragrunt + state splitting by team, each team owns its own state file
  • **HashiCorp Sentinel in production**: GitHub Enterprise uses Sentinel to block public S3 buckets and enforce tags on all resources
  • **Lyft in 2019**: migration from one 8000-resource state to 60+ isolated states - apply time dropped from 45 minutes to 3, blast radius shrank by 99%

Workspaces: one codebase, many environments

In 2021 a GitLab engineer accidentally rolled out a production-database migration to a staging instance - the wrong Terraform workspace was selected. A **workspace** in Terraform is a named instance of a state file for a single configuration. The same `main.tf` can manage dev, staging, and prod if a workspace is created for each: `terraform workspace new prod`. The `${terraform.workspace}` variable is available inside the config and lets resource names, instance types, and pricing plans vary per environment.

Workspaces work well for feature branches (short-lived isolated stacks) but are NOT recommended for long-lived environment separation (dev/staging/prod). Reason: one backend, one state location - a human `terraform workspace select` error leads to disasters. Best practice for prod is separate directories or repositories with their own backend (terragrunt or simply `envs/prod/main.tf`, `envs/dev/main.tf`).

0

1

Sign In

Why does HashiCorp officially NOT recommend workspaces for separating prod/staging/dev?

Remote State and locking

A local `terraform.tfstate` is death for a team. Two engineers run `terraform apply` at the same time - the state is overwritten and the infrastructure drifts. A **remote backend** stores state in S3/GCS/Azure Blob and locks parallel operations via DynamoDB/Cloud Storage. Each apply acquires a lock, queues its operation, and releases the lock when done.

State contains **secrets in plaintext** - database passwords, API keys, private IPs. Therefore: (1) backend encryption (S3 SSE-KMS), (2) IAM policies with least-privilege access, (3) bucket versioning (S3 versioning) for rollback after state corruption, (4) **never** commit tfstate to git. A leaked S3 bucket with state gives an attacker a dump of the entire infrastructure.

A team of 10 engineers migrates from local state to an S3 backend. What is the minimum set of components needed?

Sentinel: Policy-as-Code

In 2019 a financial company burned $30k overnight - an engineer accidentally spun up 50 GPU p3.16xlarge instances for an experiment and forgot to shut them down. **Sentinel** by HashiCorp (part of Terraform Enterprise/Cloud) is a policy-as-code framework that checks the Terraform plan before apply. Policies are written in a readable DSL and can be `advisory` (warning), `soft-mandatory` (requires manual override), or `hard-mandatory` (blocks apply).

An open-source alternative is **OPA (Open Policy Agent)** with the Rego DSL. OPA is not tied to HashiCorp and works with any JSON structures (Kubernetes admission control, API gateway, Terraform). For Terraform it is used via `conftest` (reads plan as JSON) or the **terraform-compliance** plugin (BDD-style `Given/When/Then`).

Which Sentinel enforcement level fits the rule 'all S3 buckets must have encryption enabled'?

Blast radius and how to shrink it

**Blast radius** is the volume of resources a `terraform destroy` or accidental apply can wipe out in one action. A monolithic state with 5000 resources = catastrophic blast radius (one wrong refactor can take down everything). The cure is **state splitting**: separating infrastructure into isolated state files by zone of responsibility (network, security, data, app), plus composition through the `terraform_remote_state` data source for shared outputs.

**Terragrunt** (a Terraform wrapper from Gruntwork) automates state splitting via `terragrunt.hcl` inside each environment folder. Each folder gets its own backend and variable set, and DRY is achieved via `include` and `dependency` blocks. Large teams (Lyft, Coinbase) run on Terragrunt specifically because of built-in state splitting and dependency graphs between modules.

The bigger one tfstate is, the simpler - everything in one place, all dependencies explicit.

A huge state means a huge blast radius, slow plan/apply, and high risk of concurrent conflicts. State should be split along zone-of-responsibility boundaries (network, security, data, app).

Terraform refreshes every resource in state on each plan (even when only one changes). 5000 resources = 20+ minutes per plan, and any `destroy` without -target becomes an existential threat. State splitting has been an industry standard since Terraform 0.12.

A team maintains a monolithic tfstate with 3000 resources and every `terraform plan` takes 20 minutes. Which option is NOT a correct first step to reduce blast radius?

Key ideas

  • **Workspaces** - one state per config, useful for feature branches, BUT NOT for prod/staging separation - human error causes disasters
  • **Remote state** - S3+DynamoDB (or equivalent) provides concurrent locking, versioning, and secret encryption; local tfstate inside a team is unacceptable
  • **Sentinel/OPA** - policy-as-code at plan time: blocks expensive or unsafe resources before apply, enforcement levels from advisory to hard-mandatory
  • **Blast radius** - state splitting by domain + terraform_remote_state for composition; Terragrunt automates this with DRY config
  • A monolithic state = slow plan/apply + catastrophic blast radius - the industry standard is splitting

Related topics

Advanced Terraform fits inside a wider DevOps picture:

  • GitOps and ArgoCD — Same 'state in git as source of truth' approach for Kubernetes - and the same blast-radius problems
  • Secrets management (Vault) — Vault stores dynamic secrets outside Terraform state, reducing leak risk
  • Kubernetes operators — An alternative IaC model - declarative state in etcd instead of tfstate, control loop instead of apply

Вопросы для размышления

  • Is the current team state monolithic or split? If monolithic - what is the real blast radius of a single mistaken `terraform apply`?
  • Does CI run a policy-as-code check (Sentinel/OPA/checkov) on plans? If not - what three rules would land first?
  • If the S3 bucket holding tfstate is compromised tomorrow, what would the attacker learn about the infrastructure? Which data should move into Vault?

Связанные уроки

  • devops-12 — Terraform basics - the notorious state and resource graph
  • devops-14 — Ansible manages configuration on top of Terraform infrastructure
  • devops-11 — GitOps with ArgoCD - remote state as single source of truth
  • ds-05-replication — Remote state lock - same mutual exclusion as in distributed systems
  • ds-12-service-discovery
Terraform: Advanced