DevOps
Reliability Engineering at Scale
Amazon Prime Day 2018: in the first 2 hours, technical problems caused $72M in losses. After that incident, Amazon accelerated Cell Architecture adoption. Today Amazon deploys to hundreds of cells - a bug in cell 7 affects 0.5% of users, not 100%, and is caught before propagating further.
- **Netflix** has run Chaos Monkey every business day since 2011 - terminating random production EC2 instances; incidents became so rare that new engineers often do not notice Chaos Monkey is active.
- **Stripe** has used Cell Architecture since 2016 - when one cell (15% of merchants) has an incident, 85% continue normally; rolling deployment cell-by-cell eliminated production incidents from deployments.
- **Google** publicly described their DiRT (Disaster Recovery Testing) experiments in 2019 - annual exercises where teams simulate losing an entire datacenter and verify that failover actually works as documented.
Cell Architecture
Cell Architecture partitions the system into independent cells, each serving a subset of users. A failure in one cell affects only its users. Deployments roll out cell-by-cell, limiting deployment blast radius.
Cell size trade-off: smaller cells limit blast radius but increase operational overhead. Stripe and Shopify use cells covering 1-5% of users. Cell boundaries are invisible to users - routing happens at the infrastructure layer.
What is the main advantage of Cell Architecture over standard horizontal scaling?
Blast Radius Reduction
Blast radius is the scope of damage when a component fails. Reduction techniques: circuit breakers (don't propagate failures), bulkheads (isolate resource pools), and graceful degradation (return partial results).
Graceful degradation is the contract: 'the core product works even when secondary features fail'. Amazon's product pages show 'reviews temporarily unavailable' rather than 500 errors when the review service is degraded.
What is the bulkhead pattern in the context of microservices?
Chaos Engineering
Chaos Engineering intentionally injects failures into production (or production-like) systems to verify resilience before real failures occur. Netflix's Chaos Monkey randomly terminates EC2 instances in production every business day.
Chaos Engineering only works in organizations with blameless culture. If engineers fear punishment for incidents, they will resist chaos experiments. Netflix's Chaos Monkey works because Netflix built blameless culture first.
Why run chaos experiments in production rather than only in staging?
Multi-Region Deployment
Multi-region deployment runs the system in multiple AWS/GCP/Azure regions for geographical redundancy. Active-Passive: one region serves traffic, other is on standby. Active-Active: both regions serve traffic simultaneously, requiring conflict resolution for writes.
Multi-region eliminates single-region as a Single Point of Failure, but introduces write latency (replication between regions adds 50-150ms) and operational complexity (each region needs its own monitoring, on-call, and runbooks).
Multi-region automatically means 99.99% availability
Multi-region eliminates single-region outages but not all causes of downtime: application bugs, database corruption, DNS failures, and certificate expiry can all affect all regions simultaneously.
AWS us-east-1 2021 outage: S3 control plane failure caused cascading failures in services that used S3 for configuration storage - including in other regions that depended on the same S3 bucket. Multi-region did not help because the failure was in a shared dependency.
What does RPO = 1 minute mean in the context of multi-region disaster recovery?
Summary
- **Cell Architecture** isolates users in independent cells - blast radius is limited to N% of users; deployments roll out cell-by-cell for safe rollout.
- **Blast Radius Reduction** via circuit breakers, bulkheads, and graceful degradation - component failures do not cascade; the system degrades in a controlled way.
- **Chaos Engineering + Multi-Region** - resilience verification via intentional failures in production; Aurora Global Database provides RPO < 1s and RTO < 1min for cross-region failover.
Related Topics
Reliability at scale integrates incident management and edge resilience:
- On-Call and Incident Management — Chaos experiments require a clear incident response process - chaos without runbooks is dangerous.
- CDN and Edge Computing — Cloudflare geo-routing and Load Balancing provide the first level of multi-region failover without DNS TTL propagation delay.
Вопросы для размышления
- At what company size and traffic volume does Cell Architecture start justifying its operational complexity?
- How do you run the first chaos experiment in a team that has never done this and is afraid?
- Aurora Global Database adds latency to each write with synchronous replication. How do you balance RPO and write performance?