Cloud Computing

EC2 and Virtual Machines

August 25, 2006. Amazon launches EC2 in beta. First instance type: m1.small - 1 vCPU, 1.7 GB RAM, $0.10/hour. Before this, launching a server cost $2,000-5,000 and 2-8 weeks of procurement wait. Three years later, Netflix begins migrating from its own data centers. By 2016, Netflix is fully on AWS, managing thousands of EC2 instances as software code. The idea 'buy exactly as much compute as needed right now' changed the industry.

**Airbnb** scales its fleet from 200 to 2,000+ instances in hours during holidays via Auto Scaling Groups - without EC2 this would take months of procurement
**Netflix** uses Spot instances for video encoding - 70% savings; on interruption, the encoding job restarts from a checkpoint
**Stripe** holds its baseline on Reserved Instances (3yr, all upfront) - 72% savings on predictable payment processing load

Historical context

The idea for EC2 came from Amazon's internal pain: every team spent weeks provisioning servers for new projects. Chris Pinkham (VP Infrastructure) and a team in Cape Town built a virtualization system on top of the Xen hypervisor. Andy Jassy (then VP AWS) pushed the idea as an external service. Limited beta launched in August 2006, public availability in 2008. First major customer: Smugmug (2007), saving $500K per year. Netflix began migrating in 2008. By 2023, AWS EC2 generates $40B+ in annual revenue and is the largest cloud compute provider with ~32% market share.

Instance Types

**EC2 (Elastic Compute Cloud)** is AWS's compute service, providing virtual machines on demand. Virtualization: AWS uses its own Nitro hypervisor (since 2017), previously Xen. Each instance type is a fixed combination of vCPU, RAM, networking, and storage, optimized for specific workloads.

Family	Optimized for	Examples	Typical use
t3/t4g	General purpose, burstable	t3.micro (2vCPU, 1GB)	Dev/test, low traffic, CI runners
m6i/m7i	Balanced CPU+RAM	m6i.xlarge (4vCPU, 16GB)	Application servers, microservices
c6i/c7g	Compute-optimized	c6i.2xlarge (8vCPU, 16GB)	Video encoding, ML inference, HFT
r6i/r7i	Memory-optimized	r6i.2xlarge (8vCPU, 64GB)	In-memory databases, Redis, SAP HANA
p4d/p5	GPU	p4d.24xlarge (96vCPU, 8 A100 GPU)	ML training, rendering
i3/i4i	Storage-optimized (NVMe)	i3.4xlarge (16vCPU, 122GB, 2x1.9TB NVMe)	Cassandra, Elasticsearch, OLAP

**Burstable instances (t3/t4g):** accumulate CPU credits during low-utilization periods, spend them during peaks. When credits are exhausted, performance drops to baseline (20-30% vCPU). Unlimited mode allows going into credit debt for an extra charge. For production workloads with predictable load, fixed m-series instances are preferable.

An ML team trains neural networks. Which EC2 instance family is optimal?

AMI - Amazon Machine Image

**AMI (Amazon Machine Image)** is a template for launching EC2 instances: root volume snapshot + launch permissions + block device mapping. Analogous to a Docker image, but for virtual machines. AMIs are region-specific (copy them for multi-region deployments).

**Golden AMI pattern:** create a base AMI with pre-installed agents (CloudWatch, SSM, antivirus), application runtime, and hardening (CIS Benchmark). Auto Scaling launches instances from this golden AMI - startup time 30-60 seconds instead of 5+ minutes with bootstrap scripts.

An AMI is created in us-east-1. The team wants to launch an instance in eu-west-1. What must be done?

Spot Instances

**Spot Instances** are an auction for unused EC2 capacity. Discount: 70-90% off On-Demand price. Catch: AWS can interrupt an instance with 2 minutes' notice (Spot interruption) when capacity is needed by On-Demand customers. Interruption rate depends on instance type and region - typically 5-15% per month.

Workload	Spot suitable?	Why
ML training (with checkpoints)	Yes	Checkpoint every N batches, resume on interruption
Batch video encoding	Yes	Stateless jobs from SQS, retry on interruption
Stateful production API	No	Interruption = user downtime
CI/CD runners	Yes	Failed build restarts automatically
Database master	No	Data loss on interruption without persistent volume
Kubernetes worker nodes	Yes	Karpenter automatically replaces interrupted nodes

An ML pipeline trains a model for 8 hours on a Spot instance. The instance is interrupted after 3 hours. How to minimize losses?

Reserved Instances and Savings Plans

**Reserved Instances and Savings Plans** are commitment-based discount mechanisms: a 1- or 3-year commitment to pay for compute in exchange for 30-72% off On-Demand. Savings Plans (newer) are more flexible: a $/hour spend commitment rather than a specific instance type.

Type	Discount	Flexibility	When to choose
On-Demand	0%	Full	Unpredictable load, testing
Spot	70-90%	Interruptible	Batch, ML, stateless, fault-tolerant
Savings Plans (Compute)	66%	High (any instance type/region)	Stable baseline load, flexibility matters
Standard RI (1yr)	40%	Medium (fixed type)	Stable load, known instance type
Standard RI (3yr, all upfront)	72%	Low	Mature infrastructure, long-term commitment

**Unused RI/SP problem:** the commitment keeps billing even when no instance is running. Before buying a 3-year RI, confirm via Cost Explorer that baseline utilization has been stable for 6+ months. Unused RIs can be sold on the AWS Marketplace, but at a discount.

Reserved Instances should be purchased immediately at project launch for maximum savings

RIs are purchased after 3-6 months of operation, once baseline load has stabilized. Early purchase risks locking in the wrong instance type for 1-3 years

Architecture and load change in the first months. Start with On-Demand + Spot. After stabilization - analyze Cost Explorer, buy RI/SP for the real baseline. This is standard AWS Well-Architected Framework practice.

A startup launches production: 4 m6i.large instances running constantly, plus scaling to 12 during business hours. Optimal purchasing strategy?

Key ideas

**Instance types** - families optimized for workload: t (burstable), m (balanced), c (compute), r (memory), p (GPU), i (storage NVMe)
**AMI** - VM template: region-bound, content-addressable. Golden AMI pattern for fast startup
**Spot** - 70-90% discount for interruptibility. Pattern: checkpoint + retry, diversified fleet across multiple types
**Buying strategy:** Savings Plans for baseline, On-Demand/Spot for peaks. Buy RI after 3-6 months of stabilization

Вопросы для размышления

A startup spends $50K/month on EC2 On-Demand: 20 m6i.xlarge instances running constantly. AWS Cost Explorer shows 95% utilization over the last 6 months. What purchasing strategy is recommended and what is the potential savings?

Связанные уроки

devops-04 — Docker container vs EC2 VM - the key trade-off in modern infrastructure
devops-05 — Kubernetes on EC2: EKS manages containers on top of EC2 instances
ds-04-consistent-hashing — Consistent hashing in Auto Scaling: distributing instances across AZs
bt-04-dns-tls — Route 53 + ALB: DNS and TLS termination in front of EC2 instances
opt-04 — Optimizing cloud spend: RI/Spot/On-Demand mix as an optimization problem
cloud-01
cloud-02
cloud-03
sd-10-microservices
os-12-virtualization