Cloud Computing

AWS Well-Architected Framework

The AWS Well-Architected Framework is five lenses through which Amazon examines architecture. Yet most companies violate at least one pillar critically. The typical startup overpays for cloud by 40% while having no real incident runbooks. WAF is a checklist of 200+ questions that prevents 'we never thought of that'.

  • Netflix runs a Well-Architected Review before every major launch - Chaos Engineering became part of the Reliability pillar in their internal process
  • Spotify reduced cloud costs by USD 23M per year after a WAF Cost Optimization review - purely through rightsizing and Savings Plans
  • Capital One completely rebuilt its Security pillar after the 2019 breach: Zero Trust, PrivateLink, least-privilege principle for every Lambda

Reliability Pillar: Systems That Work at 3 AM

Reliability is the ability of a system to recover from failures and continue operating. Not 'never break' - that is impossible. Rather 'break predictably and recover automatically'. Netflix Chaos Monkey deliberately breaks services in production to confirm that Reliability actually works.

Three pillars of Reliability: foundations (IAM, Service Quotas, network topology), workload architecture (circuit breakers, bulkhead pattern, timeouts), change management (blue/green, canary deployments). WAF requires designing for failure, not against it.

Recovery objectives: RTO (Recovery Time Objective) - how long the system is unavailable during an incident. RPO (Recovery Point Objective) - how much data is lost during an incident. For fintech: RTO under 1 minute, RPO = 0. For a blog: RTO = 24 hours, RPO = 1 hour. Architecture cost scales dramatically with these numbers.

Service Quotas are a hidden Reliability threat. AWS defaults: 5 VPCs per region, 20 EC2 instances per type, 50 Security Groups per ENI. At rapid growth, quotas run out without warning. Quota increase requests take 1-5 days. Service Quotas dashboard + CloudWatch alarms on approaching limits is essential.

What is RTO?

Performance Efficiency: Do More With the Same

Performance Efficiency means choosing the right resource type for the task and adapting as load changes. Not maximum performance - optimal performance. A Graviton3 instance costs 20% less than an equivalent Intel x86 for most web workloads.

Democratization of advanced technologies: instead of a custom ML pipeline - SageMaker. Instead of a homemade search engine - OpenSearch. Instead of a custom video transcoding pipeline - MediaConvert. AWS invested billions in these services - using them instead of DIY increases performance at lower cost.

Mechanical sympathy - understanding hardware for the right choice. c5 (compute-optimized) for ML inference, r5 (memory-optimized) for self-hosted Redis, i3 (storage-optimized) for Elasticsearch, p3 (GPU) for model training. Wrong instance type means 2-5x overspend at the same load.

Which EC2 type is best for an in-memory database (Redis)?

Cost Optimization: Pay Only for What Is Needed

The average company overpays for AWS by 35-45%. This is data from CloudHealth and Flexera. Causes: over-provisioning ('just in case'), forgotten resources (dev environments running on weekends), wrong pricing tier (On-Demand instead of Reserved/Spot). The WAF Cost pillar systematizes the approach.

FinOps - a cultural shift: every team sees its own costs through Cost Allocation Tags. The tag team=checkout on all checkout team resources. A Grafana dashboard with the Cost Explorer API. Weekly anomaly reviews. Amazon internally works exactly this way - every service pays for infrastructure as an internal customer.

Spot Instances for ML training: p3.8xlarge On-Demand - USD 12/hour. Spot - USD 3.6/hour. Training a GPT-style model on 100 GPU-hours: USD 1200 vs USD 360. A USD 840 difference. Checkpointing every 30 minutes is required for recovery after interruption. This is standard in research organizations.

What are Spot Instances?

Security Pillar: Defence in Depth

Security in WAF is governed by the Shared Responsibility Model: AWS is responsible for security of the cloud (physical data center, hypervisor, managed service patching), the customer is responsible for security in the cloud (IAM, encryption, network config, application). Confusion about this boundary is the source of most breaches.

Defence in depth: multiple layers. Perimeter (WAF, Shield, CloudFront), Network (Security Groups, NACLs, VPC Flow Logs), Identity (IAM, MFA, SCPs), Data (KMS encryption at rest, TLS in transit), Application (code scanning, secret management). An attacker must break through every layer.

Zero Trust in AWS: being in a VPC does not mean trust. Every service authenticates through an IAM Role, every request is verified. AWS PrivateLink removes VPC peering complexity - services communicate through private endpoints without exposing VPC CIDRs. Service Mesh (AWS App Mesh) provides mTLS between microservices.

What is AWS NOT responsible for under the Shared Responsibility Model?

Operational Excellence: Automate Everything Possible

Operational Excellence is the ability to run and monitor systems to deliver business value and continuously improve processes. Amazon internally runs on the principle: if something is done manually twice - automate it. If something breaks without notification - add an alarm.

A Runbook is a documented procedure for operational tasks. A Playbook is a set of steps for responding to incidents. AWS Systems Manager Automation turns them into code: pressing a button in the console or making an API call triggers SSM to run 20 checks and fixes automatically. A human only approves critical steps.

Observability trinity: Metrics (CloudWatch, Prometheus), Logs (CloudWatch Logs, OpenSearch), Traces (X-Ray, OpenTelemetry). WAF Operational Excellence requires all three. A metric alone without a trace does not explain why p99 latency increased. A trace without logs does not show the error details.

A Well-Architected Review is a one-time check before launching to production

WAR is a continuous process; AWS recommends running it every quarter for each workload

Architecture changes: load grows, features are added, new AWS services appear. A workload that was optimal six months ago may not be optimal today. The WAF Tool in the console lets teams track how answers change between reviews

How does a Runbook differ from a Playbook in the WAF context?

Related Topics

WAF spans the full stack from infrastructure to culture:

  • Compliance and Audit — Security pillar - audit and guardrails
  • Backpressure and Rate Limiting — Reliability patterns for resilience
  • CAP Theorem — Theoretical basis of Reliability trade-offs

Key Ideas

  • Reliability: design for failure. RTO/RPO determine architecture cost
  • Performance Efficiency: right resource type + managed services instead of DIY
  • Cost Optimization: rightsize + Savings Plans + Spot = -40-70% of spend
  • Security: Shared Responsibility + Defence in Depth + Zero Trust
  • Operational Excellence: automate everything, measure everything, runbooks as code

Вопросы для размышления

  • How do you prioritize pillars with limited resources - what matters most for a startup?
  • When are Spot Instances unacceptable even if the savings are 70%?
  • How do you integrate Well-Architected Review into an agile process without turning it into a quarterly ritual?

Связанные уроки

  • cloud-15 — Compliance is part of the Security pillar of WAF
  • devops-16 — Prometheus/Grafana is the tool for Operational Excellence
  • bt-22-backpressure — Backpressure is a Reliability pillar pattern
  • ds-02-cap-theorem — CAP theorem is the theoretical foundation of Reliability trade-offs
  • devops-01
AWS Well-Architected Framework

0

1

Sign In