DevOps
Autoscaling and HPA
Black Friday: Pinterest traffic grows 8x in 10 minutes. Before autoscaling, the team manually added servers in advance - and guessed wrong by 30%. After autoscaling, the system scales itself. The team watches dashboards instead of SSH-ing into servers at 2am.
- **Delivery Hero** uses KEDA to scale order-processing workers: 500 workers at dinner time, 0 pods (and 0 cost) at 3am.
- **Spotify** migrated from Cluster Autoscaler to Karpenter - node provisioning speed improved from 3-4 minutes to 45 seconds during load spikes after an artist release.
- **Shopify** used VPA to analyze resource requests for 3,000+ microservices - 60% were over-provisioned 3-5x; savings of $2M/year.
HPA (Horizontal Pod Autoscaler)
HPA automatically changes the number of replicas in a Deployment based on metrics. Default: CPU utilization relative to pod requests. Extended: custom metrics (RPS, queue depth) via Metrics Server or Prometheus Adapter.
Accurate resource requests are the foundation of correct HPA behavior. If CPU requests are set too low, HPA sees artificially high utilization and over-scales. Use VPA recommendations to calibrate requests before enabling HPA.
HPA is configured at CPU 70%. A pod has requests of 100m CPU and actually consumes 80m. What CPU utilization does HPA see?
VPA (Vertical Pod Autoscaler)
VPA automatically recommends and sets correct resource requests/limits for pods based on historical consumption. Mode Off: recommendations only. Mode Auto: applies changes (restarts pods).
Shopify analyzed 3,000+ microservices with VPA and found 60% were over-provisioned 3-5x. Applying VPA recommendations saved $2M/year without any service degradation.
Why can VPA in Auto mode and HPA on CPU not be used simultaneously for the same Deployment?
Cluster Autoscaler and Karpenter
Cluster Autoscaler (CA) scales cluster nodes: adds nodes when pods are Pending (no space), removes nodes when underutilized. Karpenter is the next-generation alternative: 30-60s provisioning vs 2-5 minutes for CA, and selects optimal instance type per workload rather than using predefined node pools.
Karpenter bin-packing optimization can reduce node count by 20-40% by selecting instance types that fit actual workload requirements rather than using fixed node pools.
What makes Karpenter better than Cluster Autoscaler during load spikes?
KEDA (Event-Driven Autoscaling)
KEDA (Kubernetes Event-Driven Autoscaling) scales Deployments based on external events: SQS queue depth, Kafka consumer lag, Prometheus metrics, cron schedule. Unlike HPA, KEDA can scale to 0 replicas when there is no work to process.
Scale-to-zero is KEDA's key differentiator from HPA: batch workers that process messages for 4 hours/day cost nothing for the remaining 20 hours. With HPA minimum replicas=2, they cost 24/7.
Set HPA and forget - autoscaling will handle everything automatically
Effective autoscaling requires: accurate resource requests (VPA), appropriate scaling triggers (HPA or KEDA), and sufficient node capacity (Karpenter). Misconfigured requests make HPA oscillate; missing Karpenter makes new pods stay Pending.
A common failure: HPA scales pods correctly but nodes take 5 minutes to provision (CA). During those 5 minutes, pods are Pending and traffic is unserved. Karpenter reduces that gap to 30-60 seconds.
Why can KEDA scale to 0 replicas while standard HPA cannot?
Key Ideas
- **HPA** scales pods on CPU/custom metrics; accurate resource requests are the foundation; cooldown period prevents flapping.
- **VPA** recommends correct requests/limits from historical consumption; use Off mode for analysis, not Auto simultaneously with HPA on CPU.
- **Karpenter + KEDA** - next level: Karpenter provisions nodes in 30s with optimal instance type; KEDA scales to 0 on external events (SQS, Kafka).
Related Topics
Autoscaling layers work together from pod to node to event:
- K8s: Advanced Patterns — HPA, VPA, and Karpenter are Kubernetes API extensions; they work with the same Deployment and StatefulSet resources.
- Serverless: Lambda and Cloud Functions — Lambda scales automatically to 1,000 concurrent without configuration; Kubernetes autoscaling is explicit configuration of the same behavior.
Вопросы для размышления
- If a service processes batch tasks from an SQS queue at night - HPA on CPU or KEDA on queue depth? Why?
- During a traffic spike, pods scale quickly but nodes take 2-3 minutes to provision. How do you minimize this gap?
- VPA recommends reducing CPU requests from 500m to 120m for a service. What are the risks of accepting this recommendation?