Cloud Computing
EC2 and Virtual Machines
August 25, 2006. Amazon launches EC2 in beta. First instance type: m1.small - 1 vCPU, 1.7 GB RAM, $0.10/hour. Before this, launching a server cost $2,000-5,000 and 2-8 weeks of procurement wait. Three years later, Netflix begins migrating from its own data centers. By 2016, Netflix is fully on AWS, managing thousands of EC2 instances as software code. The idea 'buy exactly as much compute as needed right now' changed the industry.
- **Airbnb** scales its fleet from 200 to 2,000+ instances in hours during holidays via Auto Scaling Groups - without EC2 this would take months of procurement
- **Netflix** uses Spot instances for video encoding - 70% savings; on interruption, the encoding job restarts from a checkpoint
- **Stripe** holds its baseline on Reserved Instances (3yr, all upfront) - 72% savings on predictable payment processing load
Historical context
The idea for EC2 came from Amazon's internal pain: every team spent weeks provisioning servers for new projects. Chris Pinkham (VP Infrastructure) and a team in Cape Town built a virtualization system on top of the Xen hypervisor. Andy Jassy (then VP AWS) pushed the idea as an external service. Limited beta launched in August 2006, public availability in 2008. First major customer: Smugmug (2007), saving $500K per year. Netflix began migrating in 2008. By 2023, AWS EC2 generates $40B+ in annual revenue and is the largest cloud compute provider with ~32% market share.
Instance Types
**EC2 (Elastic Compute Cloud)** is AWS's compute service, providing virtual machines on demand. Virtualization: AWS uses its own Nitro hypervisor (since 2017), previously Xen. Each instance type is a fixed combination of vCPU, RAM, networking, and storage, optimized for specific workloads.
| Family | Optimized for | Examples | Typical use |
|---|---|---|---|
| t3/t4g | General purpose, burstable | t3.micro (2vCPU, 1GB) | Dev/test, low traffic, CI runners |
| m6i/m7i | Balanced CPU+RAM | m6i.xlarge (4vCPU, 16GB) | Application servers, microservices |
| c6i/c7g | Compute-optimized | c6i.2xlarge (8vCPU, 16GB) | Video encoding, ML inference, HFT |
| r6i/r7i | Memory-optimized | r6i.2xlarge (8vCPU, 64GB) | In-memory databases, Redis, SAP HANA |
| p4d/p5 | GPU | p4d.24xlarge (96vCPU, 8 A100 GPU) | ML training, rendering |
| i3/i4i | Storage-optimized (NVMe) | i3.4xlarge (16vCPU, 122GB, 2x1.9TB NVMe) | Cassandra, Elasticsearch, OLAP |
**Burstable instances (t3/t4g):** accumulate CPU credits during low-utilization periods, spend them during peaks. When credits are exhausted, performance drops to baseline (20-30% vCPU). Unlimited mode allows going into credit debt for an extra charge. For production workloads with predictable load, fixed m-series instances are preferable.
An ML team trains neural networks. Which EC2 instance family is optimal?
AMI - Amazon Machine Image
**AMI (Amazon Machine Image)** is a template for launching EC2 instances: root volume snapshot + launch permissions + block device mapping. Analogous to a Docker image, but for virtual machines. AMIs are region-specific (copy them for multi-region deployments).
**Golden AMI pattern:** create a base AMI with pre-installed agents (CloudWatch, SSM, antivirus), application runtime, and hardening (CIS Benchmark). Auto Scaling launches instances from this golden AMI - startup time 30-60 seconds instead of 5+ minutes with bootstrap scripts.
An AMI is created in us-east-1. The team wants to launch an instance in eu-west-1. What must be done?
Spot Instances
**Spot Instances** are an auction for unused EC2 capacity. Discount: 70-90% off On-Demand price. Catch: AWS can interrupt an instance with 2 minutes' notice (Spot interruption) when capacity is needed by On-Demand customers. Interruption rate depends on instance type and region - typically 5-15% per month.
| Workload | Spot suitable? | Why |
|---|---|---|
| ML training (with checkpoints) | Yes | Checkpoint every N batches, resume on interruption |
| Batch video encoding | Yes | Stateless jobs from SQS, retry on interruption |
| Stateful production API | No | Interruption = user downtime |
| CI/CD runners | Yes | Failed build restarts automatically |
| Database master | No | Data loss on interruption without persistent volume |
| Kubernetes worker nodes | Yes | Karpenter automatically replaces interrupted nodes |
An ML pipeline trains a model for 8 hours on a Spot instance. The instance is interrupted after 3 hours. How to minimize losses?
Reserved Instances and Savings Plans
**Reserved Instances and Savings Plans** are commitment-based discount mechanisms: a 1- or 3-year commitment to pay for compute in exchange for 30-72% off On-Demand. Savings Plans (newer) are more flexible: a $/hour spend commitment rather than a specific instance type.
| Type | Discount | Flexibility | When to choose |
|---|---|---|---|
| On-Demand | 0% | Full | Unpredictable load, testing |
| Spot | 70-90% | Interruptible | Batch, ML, stateless, fault-tolerant |
| Savings Plans (Compute) | 66% | High (any instance type/region) | Stable baseline load, flexibility matters |
| Standard RI (1yr) | 40% | Medium (fixed type) | Stable load, known instance type |
| Standard RI (3yr, all upfront) | 72% | Low | Mature infrastructure, long-term commitment |
**Unused RI/SP problem:** the commitment keeps billing even when no instance is running. Before buying a 3-year RI, confirm via Cost Explorer that baseline utilization has been stable for 6+ months. Unused RIs can be sold on the AWS Marketplace, but at a discount.
Reserved Instances should be purchased immediately at project launch for maximum savings
RIs are purchased after 3-6 months of operation, once baseline load has stabilized. Early purchase risks locking in the wrong instance type for 1-3 years
Architecture and load change in the first months. Start with On-Demand + Spot. After stabilization - analyze Cost Explorer, buy RI/SP for the real baseline. This is standard AWS Well-Architected Framework practice.
A startup launches production: 4 m6i.large instances running constantly, plus scaling to 12 during business hours. Optimal purchasing strategy?
Key ideas
- **Instance types** - families optimized for workload: t (burstable), m (balanced), c (compute), r (memory), p (GPU), i (storage NVMe)
- **AMI** - VM template: region-bound, content-addressable. Golden AMI pattern for fast startup
- **Spot** - 70-90% discount for interruptibility. Pattern: checkpoint + retry, diversified fleet across multiple types
- **Buying strategy:** Savings Plans for baseline, On-Demand/Spot for peaks. Buy RI after 3-6 months of stabilization
Вопросы для размышления
- A startup spends $50K/month on EC2 On-Demand: 20 m6i.xlarge instances running constantly. AWS Cost Explorer shows 95% utilization over the last 6 months. What purchasing strategy is recommended and what is the potential savings?
Связанные уроки
- devops-04 — Docker container vs EC2 VM - the key trade-off in modern infrastructure
- devops-05 — Kubernetes on EC2: EKS manages containers on top of EC2 instances
- ds-04-consistent-hashing — Consistent hashing in Auto Scaling: distributing instances across AZs
- bt-04-dns-tls — Route 53 + ALB: DNS and TLS termination in front of EC2 instances
- opt-04 — Optimizing cloud spend: RI/Spot/On-Demand mix as an optimization problem
- cloud-01
- cloud-02
- cloud-03
- sd-10-microservices
- os-12-virtualization