Cloud Computing
Serverless: Lambda and Cloud Functions
AWS Lambda processes 10 trillion requests per month. Each request is a separate container. 10 trillion start-stop cycles in 30 days. 2014: AWS Lambda. The first serverless service. The idea: code without managing a server. Pay only when the function runs. Amazon S3 event -> Lambda -> DynamoDB. No EC2 required.
- Vercel Edge Functions: Next.js API routes run as Lambda under the hood - each /api/* endpoint is a serverless function
- GitHub Actions: each job is a serverless container, billed only for execution time
- Stripe webhooks: Lambda handles 100 events/sec without EC2 - burst traffic on Black Friday
- ML pipeline: S3 event -> Lambda -> Bedrock inference -> DynamoDB - full pipeline with no servers to manage
Cold Start and Execution Model
**AWS Lambda** executes code without managing servers. Each invocation gets an isolated container - a microVM powered by Firecracker. The execution lifecycle has two phases: **Init Phase** and **Invoke Phase**. Init Phase has three steps: container creation (Container Init), runtime initialization - loading the Python/Node/Java interpreter (Runtime Init), and function initialization - executing code outside the handler (Function Init). When a warm container already exists, Lambda skips Init Phase entirely and goes straight to Invoke. That gap between warm and cold execution is cold start.
**Cold start benchmarks**: Node.js: 100-300ms. Python: 200-500ms. Java (JVM): 1-4 seconds. Python with PyTorch: 3-8 seconds. Go/Rust: 50-150ms. Deployment package size matters directly - each MB adds ~10-20ms. VPC cold start adds 1-2 seconds: Lambda must create an ENI (Elastic Network Interface) inside the VPC before execution.
**Provisioned Concurrency** eliminates cold start entirely: Lambda keeps N pre-warmed instances running at all times. Cost: payment for idle time (~0.015 USD/GB-hour). For APIs with SLA < 100ms this is the only viable approach. Application Auto Scaling can adjust Provisioned Concurrency on a schedule - more instances during business hours, fewer overnight.
A Python Lambda with PyTorch shows p99 latency of 9 seconds, while p50 is 300ms. What is happening?
Concurrency Model
Lambda scales horizontally and automatically: each concurrent request gets its own function instance. 1000 simultaneous requests = 1000 parallel instances. Default account limit: **1000 concurrent executions** (increase via Service Quotas). Burst concurrency: Lambda can launch up to 3000 new instances in the first minute, then 500 new instances per minute. When the limit is exceeded, functions receive **ThrottlingException (HTTP 429)**. Callers must implement exponential backoff.
**Reserved Concurrency** reserves N slots for a specific function. It guarantees the function always gets N slots even when other functions saturate the account limit. It also acts as a ceiling: the function cannot exceed N regardless of traffic. **Provisioned Concurrency** keeps N pre-warmed instances running (no cold start), billed for the reserved capacity.
**LLM webhook processing**: Anthropic or OpenAI send batch notifications simultaneously (100-500 requests/sec for popular apps). Lambda auto-creates 100-500 parallel instances. Without Reserved Concurrency, that burst can starve other functions in the account. With SQS in front of Lambda, the batch is processed at controlled concurrency via batch size setting.
A production account has 5 Lambda functions. The critical payment function suddenly starts receiving ThrottlingException errors. Other functions work fine. What happened?
Event Sources and Triggers
Lambda is invoked by events from various sources. Two integration patterns exist: **Push model** - a service directly invokes Lambda (API Gateway, S3, SNS, EventBridge); **Poll model** - Lambda reads from a queue or stream itself (SQS, Kinesis, DynamoDB Streams). The distinction is critical for retry semantics. For async invocations (S3, SNS), Lambda retries twice on failure and then sends to a Dead Letter Queue. For SQS poll model, the message returns to the queue and is reprocessed up to maxReceiveCount times.
**API Gateway + Lambda**: synchronous invocation, the client waits for a response. API Gateway timeout is 29 seconds (hard limit). Lambda timeout can be up to 15 minutes, but the 29s cap matters for this integration. **S3 Event + Lambda**: asynchronous. S3 notifies Lambda of a new object; Lambda processes it. On failure: 2 automatic retries. For guaranteed processing: S3 -> SQS -> Lambda (at-least-once delivery via queue).
A Lambda function processes S3 events. Occasionally the same file gets processed twice. Why does this happen?
Serverless Economics
Lambda pricing has two dimensions: **request count** (1M requests/month free, then 0.20 USD per million) and **duration** in GB-seconds (execution time multiplied by allocated memory). Free tier: 400,000 GB-seconds per month. Example: a 128MB function running for 200ms consumes 0.025 GB-seconds. 1M invocations = 25,000 GB-seconds - within the free tier. Break-even point vs EC2: at under ~15% utilization Lambda is cheaper. At constant 24/7 load, EC2 (especially Reserved Instances) wins by 3-10x.
**Hidden serverless costs**: API Gateway is priced separately - 1.00 USD per 1M requests (HTTP API) or 3.50 USD (REST API). Data transfer: outbound traffic from Lambda is billed the same as EC2 egress - 0.09 USD/GB. Provisioned Concurrency adds 0.015 USD/GB-hour on top. Under high load, these costs can exceed the price of an EC2 instance.
**Serverless is not always cheaper**: for constant 24/7 load, EC2 Reserved Instances cost 3-10x less. Lambda is optimal for: event-driven workloads (S3 events, webhooks), infrequent tasks (cron jobs), unpredictable traffic spikes, and minimal operational overhead. Always compare total cost: Lambda + API Gateway + data transfer vs EC2 + ALB + EBS.
Serverless means no servers exist - code runs in a cloud with no physical hardware
Serverless means no server management, not no servers. AWS Firecracker microVMs run on Amazon's physical servers. Serverless is an operational model, not an execution architecture.
Understanding the physical reality of serverless explains cold start (a container must be created on real hardware), network latency (data physically travels between data centers), and the billing model (execution time equals real CPU usage on physical servers).
A Lambda API (256MB, 100ms avg) handles 500M requests/month. The team is evaluating migration to EC2 (t3.medium, On-Demand, 33 USD/month). Lambda costs 200 USD/month. Which is preferable?
Key Ideas
- **Cold start** - Init Phase (container + runtime + function init). Heavy runtimes (PyTorch, JVM) cause 3-8s cold start. Mitigations: ONNX instead of PyTorch, Provisioned Concurrency, initializing resources outside the handler
- **Concurrency** - 1 request = 1 instance, auto-scales to 1000 (account limit). Reserved Concurrency guarantees slots for critical functions. Throttling returns 429; callers must implement backoff
- **Triggers** - Push model (API Gateway, S3, SNS) vs Poll model (SQS, Kinesis). Async invocations: 2 retries + DLQ. At-least-once semantics means handlers must be idempotent
- **Economics** - Pay-per-use is cost-effective at < 15% utilization and for spiky traffic. Sustained 24/7 load: EC2 Reserved is 3-10x cheaper. Hidden costs: API Gateway + egress + Provisioned Concurrency
Related Concepts
Serverless intersects with several areas of cloud and distributed systems
- EC2 and Virtual Machines — Lambda vs EC2 - the core trade-off between operational simplicity and cost efficiency at scale
- Kubernetes and Containers — Knative/EKS on Fargate - serverless Kubernetes as a bridge between Lambda and EC2
- Queues and Streaming — SQS in front of Lambda - buffering burst traffic and controlling concurrency
Вопросы для размышления
- A startup is building a real-time voice assistant: microphone -> API -> ML model (Whisper + LLM) -> response. The team proposes Lambda for the entire chain. Cold start for Python + Whisper is 6-8 seconds. How should the architecture handle the latency problem while keeping serverless benefits where they make sense?