AI Engineering
Model Serving: Deploying Models to Production - TGI, vLLM, Triton, SageMaker
Цели урока
- Choose an inference framework for a specific scenario: Ollama / TGI / vLLM / Triton
- Deploy LLMs via TGI and vLLM in Docker with production-ready configuration
- Understand continuous batching and PagedAttention - key optimizations for high throughput
- Configure autoscaling for GPU workloads in Kubernetes accounting for the cold start (2 min) problem
- Monitor inference servers: TTFT, TPS, KV-cache utilization, GPU health
vLLM delivers 24x throughput over naive inference through PagedAttention. The gap is the same as blocking I/O vs async - only for GPU memory. One Ollama container collapses at 50 concurrent requests. vLLM handles 1,400 tok/s on the same A100 at 128 concurrent users. Autoscaling on spot instances saves 70% on GPU costs. All of this - no model changes, no fine-tuning, just the right serving infrastructure.
- HuggingFace TGI serves ChatUI for 30K+ concurrent users on an H100 cluster - continuous batching at the core
- Perplexity AI: 100M+ requests/month on vLLM, PagedAttention cut their GPU fleet by 60% at the same throughput
- Kwon et al. 2023 (UC Berkeley) - original vLLM/PagedAttention paper, 24x throughput vs naive approach
- One unoptimized inference server costs 10,000 dollars/month; an optimized one - 2,000 at the same throughput
- P99 TTFT <200ms is a hard requirement in fintech and healthcare - every second of latency costs real money
PagedAttention: how Berkeley students changed inference
June 2023. Woosuk Kwon, Zhuohan Li, and colleagues at UC Berkeley publish "Efficient Memory Management for Large Language Model Serving with PagedAttention". The idea is simple to the point of genius: manage KV-cache as virtual memory pages - exactly the way an OS has managed RAM since the 1960s. Before this, every request pre-reserved a maximum memory block regardless of actual generation length. 70% of VRAM was wasted. PagedAttention allocates pages on demand and frees them the moment a request finishes. Result: 24x throughput on a single GPU. vLLM shipped as the open-source implementation and became the de-facto production inference standard within months.
Предварительные знания
Serving Landscape: TGI, vLLM, Triton, Managed Services
Running a model on a laptop via Ollama takes five minutes. Serving 10,000 concurrent users with a 99.9% SLA and p99 latency under two seconds is a completely different discipline. The gap between these two worlds is called **model serving** - the infrastructure layer between model weights and production traffic.
A naive deployment (one Docker container with Ollama) collapses around 50 concurrent requests. Not because the model is bad. Because Ollama processes requests **sequentially** - no continuous batching, no PagedAttention, no KV-cache management. It's like blocking I/O instead of async: one request occupies the GPU, the rest queue up and wait.
**Key requirements for production serving:**
- **High throughput** - hundreds to thousands of requests per second on the minimum number of GPUs
- **Low latency** - time to first token <200ms, end-to-end <2-5s for a typical response
- **Horizontal scaling** - adding GPU nodes as load grows
- **Health checks & graceful shutdown** - zero-downtime deploys
- **Monitoring** - token throughput, queue depth, GPU utilization, error rate
- **Model versioning** - A/B testing, canary deploys, rollback
**Major inference frameworks:**
| Framework | Company | Key Feature | Best For | Complexity |
|---|---|---|---|---|
| vLLM | UC Berkeley → community | PagedAttention, continuous batching | Pure LLM serving, max throughput | Medium |
| TGI (Text Generation Inference) | HuggingFace | Production-ready out of the box, metrics | HF ecosystem, Docker deploy | Low |
| Triton Inference Server | NVIDIA | Multi-model, multi-framework | Heterogeneous workloads (LLM + CV + audio) | High |
| TensorRT-LLM | NVIDIA | Max performance on NVIDIA GPUs | When every ms counts | High |
| Ollama | Community | Simplicity | Dev, demo, small production | Very low |
| SageMaker Endpoints | AWS | Managed, auto-scaling | AWS-first companies | Low (but expensive) |
| Ray Serve | Anyscale | Distributed computing | Complex pipelines, multi-model |
TGI: Production Deployment with Docker
**Text Generation Inference (TGI)** from HuggingFace is the most "batteries-included" solution on the market. One Docker container delivers: Prometheus metrics, health endpoints, OpenAI-compatible API, flash attention, quantization (AWQ, GPTQ), tensor parallelism across multiple GPUs. No boilerplate - everything out of the box.
The launch command looks long, but every flag matters. `--max-concurrent-requests 128` is a soft queue limit, not a hard cap. `--quantize awq` cuts VRAM in half with negligible quality loss. `--num-shard 2` enables tensor parallelism for 70B models that don't fit on a single GPU.
**Docker Compose for production with monitoring:**
**Integration with a Node.js backend** - TGI provides an OpenAI-compatible endpoint:
**TGI vs raw vLLM Docker:** TGI has built-in Prometheus metrics, structured logging, watermarking support, and better documentation. vLLM has slightly higher raw throughput. For most production deployments the difference is not critical - both work great.
In the docker-compose.yml for TGI, the healthcheck has start_period: 120s. Why?
vLLM: Continuous Batching and Throughput Optimization
June 2023. Woosuk Kwon and a team from UC Berkeley publish "Efficient Memory Management for Large Language Model Serving with PagedAttention". The core idea: store KV-cache as virtual memory pages - exactly how an OS manages RAM. The result is a 24x throughput increase over naive inference on the same hardware.
Think of it as the difference between blocking I/O and async. Only for GPU memory.
**PagedAttention in detail:**
The second innovation is **continuous batching**. Classic static batching waits for an entire batch to finish before accepting new requests. Continuous batching adds new requests the moment a slot frees up. The GPU never idles.
**Production deployment of vLLM** with optimal parameters:
**Throughput benchmark** (Llama 3.1 8B, A100 80GB):
| Concurrent requests | Ollama (tok/s total) | vLLM (tok/s total) | TGI (tok/s total) |
|---|---|---|---|
| 1 | 55 | 50 | 48 |
| 4 | 55 | 180 | 165 |
| 16 | 55 | 520 | 470 |
| 32 | 52 | 860 | 750 |
| 64 | 48 (queue!) | 1200 | 1050 |
| 128 | timeout | 1400 | 1200 |
**Prefix caching** is an optimization for repeating system prompts. If all requests start the same way (system prompt, few-shot examples), vLLM computes the KV-cache for that prefix once and reuses it:
**Ollama throughput doesn't scale with concurrency** - it lacks continuous batching. At 64 concurrent requests, Ollama processes them sequentially (~48 tok/s total), while vLLM processes them in parallel (~1200 tok/s total). For production with >10 concurrent users - vLLM or TGI is mandatory.
One GPU handles one LLM request at a time - the rest just wait in line
Continuous batching lets a GPU handle dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish
The CPU-world intuition (one thread = one request) doesn't apply to GPU inference. GPUs are massively parallel: hundreds of streaming multiprocessors can work on different sequences in a single forward pass. Continuous batching leverages this - as soon as request A finishes, its KV-cache slot is freed and request D jumps in immediately. No waiting. At 64 concurrent users, Ollama (no CB) delivers 48 tok/s. vLLM (with CB) delivers 1200 tok/s. A 25x gap.
GPU Autoscaling: Kubernetes, Spot Instances, Cold Start
LLM inference has a characteristic load pattern: **spiky traffic** - peak hours hitting 5-10x above average. Provisioning GPUs for peak load around the clock is money left on the table. Autoscaling covers the gap - companies running A100 clusters save anywhere from 3,000 to 50,000 dollars a month depending on scale.
**The GPU autoscaling challenge:** unlike CPU workloads, LLM pods start **slowly** - 1-3 minutes to load a model into GPU. Standard HPA reacts in ~60 seconds, then the pod needs another 120 to come up. The traffic spike has already passed, users got timeouts, the pod just became ready.
**Kubernetes deployment with HPA:**
**Spot/preemptible instances** - 60-70% savings on GPUs:
| GPU | On-Demand ($/hr) | Spot ($/hr) | Savings |
|---|---|---|---|
| NVIDIA L4 | USD 0.70 | USD 0.24 | 66% |
| NVIDIA A100 40GB | USD 3.67 | USD 1.10 | 70% |
| NVIDIA A100 80GB | USD 4.38 | USD 1.31 | 70% |
| NVIDIA H100 | USD 8.50 | USD 2.55 | 70% |
**Spot instances can be reclaimed** with 30s notice. In-flight requests are interrupted. Solution: graceful shutdown handler (finish current requests, stop accepting new ones), retry middleware on the client, fallback to cloud API.
An LLM serving pod on Kubernetes takes 2 minutes to start (model loading). A traffic spike hits in 30 seconds. What strategy is best?
Monitoring Inference Servers: Metrics, Alerts, Debugging
An LLM inference server is not a regular HTTP service. A GPU costs 1-10 dollars an hour. If GPU utilization sits at 30% instead of 80%, that's not inefficiency - it's direct money loss. One poorly optimized A100-based inference server runs 10,000 dollars a month. An optimized one at the same throughput costs 2,000.
Standard metrics (CPU, memory, error rate) tell almost nothing useful about inference health. What matters is **tokens/s**, **KV-cache utilization**, **TTFT**, **avg batch size**. These are the signals that reveal the real system state - seconds before an OOM or throughput collapse.
**Key metrics:**
| Metric | What It Measures | Healthy Range | Alert Threshold |
|---|---|---|---|
| TTFT (Time to First Token) | Delay before the first token | <200ms | >500ms (p95) |
| TPS (Tokens per Second) | Generation speed | 50-150 tok/s per GPU | <30 tok/s |
| Request queue depth | Queue of waiting requests | 0-10 | >50 (scale up!) |
| GPU utilization | GPU compute load | 60-90% | <30% (over-provisioned) or >95% (throttling) |
| GPU memory used | VRAM usage | 80-90% | >95% (OOM risk) |
| KV-cache utilization | KV-cache fill level | 50-80% | >90% (preemption risk) |
| Request error rate | Percentage of errors | <0.1% | >1% |
| Avg batch size | Average batch size | depends on load | Always 1 = no batching (problem) |
**Custom monitoring middleware** for a Node.js backend:
**Typical failure modes** and their diagnostics:
| Symptom | Cause | Diagnostics | Solution |
|---|---|---|---|
| TTFT increased 10x | GPU throttling (overheating >83°C) | nvidia-smi -q | Improve cooling, reduce batch size |
| All requests timeout | OOM - model crashed | dmesg | grep oom | Reduce max-model-len or max-num-seqs |
| Throughput dropped 50% |
One GPU handles one LLM request at a time - the rest just wait in line
Continuous batching lets a GPU process dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish
The CPU-world intuition (one thread = one request) doesn't transfer to GPU inference. GPUs are massively parallel: hundreds of streaming multiprocessors can work on different sequences in a single forward pass. Continuous batching leverages this - the moment request A finishes, its KV-cache slot is freed and request D steps in immediately. No waiting at all. At 64 concurrent users, Ollama (without CB) delivers 48 tok/s. vLLM (with CB) delivers 1,200 tok/s. A 25x gap on identical hardware.
Key Takeaways
- Serving landscape: Ollama (dev) → TGI (simple production) → vLLM (max throughput) → Triton (multi-model)
- TGI: Docker one-liner, Prometheus metrics, health endpoints - batteries-included production deploy
- vLLM (Kwon et al. 2023): PagedAttention (95% memory utilization) + continuous batching = 24x throughput vs naive
- Prefix caching: shared system prompt computed once → massive savings with repeating prompts
- GPU autoscaling: predictive (cron) + warm standby + queue buffer - covering cold start (2 min pod startup)
- Spot instances: 60-70% savings, but require graceful shutdown, retry middleware, and fallback to cloud API
- Monitoring: TTFT (p95 <500ms), TPS (>30/GPU), queue depth (<50), KV-cache (<90%)
Вопросы для размышления
- PagedAttention solves VRAM fragmentation using virtual memory. What other OS patterns could apply to LLM inference - for example, swap, huge pages, or memory-mapped files?
- Throughput vs latency tradeoff: larger batch size increases throughput but TTFT grows. How to pick the optimal max-num-seqs for a service with a p99 latency SLA of <500ms under uneven load?
- Speculative decoding (a draft model generates tokens, the main model verifies) theoretically cuts latency without quality loss. Why isn't it the default in vLLM/TGI?
What's Next
Model serving is the final element of ML infrastructure. Next up - system design and specialized applications.
- AI System Design — Model serving as part of an end-to-end AI system: routing, caching, serving, monitoring
- Cost Management — Optimizing GPU costs through spot instances, quantization, model routing
- Observability — Inference monitoring as part of the overall observability strategy for AI systems
Связанные уроки
- aie-39-local-models — Serving scales local inference to production
- aie-42-ai-system-design — Serving is a core AI system component
- aie-29-cost-management — Batching and GPU use govern serving cost
- aie-35-observability — Monitor latency and throughput in serving
- ml-46-model-serving — Same serving patterns for ML models
- sd-03-scalability