AI Engineering

Model Serving: Deploying Models to Production - TGI, vLLM, Triton, SageMaker

Цели урока

Choose an inference framework for a specific scenario: Ollama / TGI / vLLM / Triton
Deploy LLMs via TGI and vLLM in Docker with production-ready configuration
Understand continuous batching and PagedAttention - key optimizations for high throughput
Configure autoscaling for GPU workloads in Kubernetes accounting for the cold start (2 min) problem
Monitor inference servers: TTFT, TPS, KV-cache utilization, GPU health

vLLM delivers 24x throughput over naive inference through PagedAttention. The gap is the same as blocking I/O vs async - only for GPU memory. One Ollama container collapses at 50 concurrent requests. vLLM handles 1,400 tok/s on the same A100 at 128 concurrent users. Autoscaling on spot instances saves 70% on GPU costs. All of this - no model changes, no fine-tuning, just the right serving infrastructure.

HuggingFace TGI serves ChatUI for 30K+ concurrent users on an H100 cluster - continuous batching at the core
Perplexity AI: 100M+ requests/month on vLLM, PagedAttention cut their GPU fleet by 60% at the same throughput
Kwon et al. 2023 (UC Berkeley) - original vLLM/PagedAttention paper, 24x throughput vs naive approach
One unoptimized inference server costs 10,000 dollars/month; an optimized one - 2,000 at the same throughput
P99 TTFT <200ms is a hard requirement in fintech and healthcare - every second of latency costs real money

PagedAttention: how Berkeley students changed inference

June 2023. Woosuk Kwon, Zhuohan Li, and colleagues at UC Berkeley publish "Efficient Memory Management for Large Language Model Serving with PagedAttention". The idea is simple to the point of genius: manage KV-cache as virtual memory pages - exactly the way an OS has managed RAM since the 1960s. Before this, every request pre-reserved a maximum memory block regardless of actual generation length. 70% of VRAM was wasted. PagedAttention allocates pages on demand and frees them the moment a request finishes. Result: 24x throughput on a single GPU. vLLM shipped as the open-source implementation and became the de-facto production inference standard within months.

Предварительные знания

Local LLM: Ollama, llama.cpp, vLLM - Running Models on Self-Hosted Hardware

Serving Landscape: TGI, vLLM, Triton, Managed Services

Running a model on a laptop via Ollama takes five minutes. Serving 10,000 concurrent users with a 99.9% SLA and p99 latency under two seconds is a completely different discipline. The gap between these two worlds is called **model serving** - the infrastructure layer between model weights and production traffic.

A naive deployment (one Docker container with Ollama) collapses around 50 concurrent requests. Not because the model is bad. Because Ollama processes requests **sequentially** - no continuous batching, no PagedAttention, no KV-cache management. It's like blocking I/O instead of async: one request occupies the GPU, the rest queue up and wait.

**Key requirements for production serving:**

**High throughput** - hundreds to thousands of requests per second on the minimum number of GPUs
**Low latency** - time to first token <200ms, end-to-end <2-5s for a typical response
**Horizontal scaling** - adding GPU nodes as load grows
**Health checks & graceful shutdown** - zero-downtime deploys
**Monitoring** - token throughput, queue depth, GPU utilization, error rate
**Model versioning** - A/B testing, canary deploys, rollback

**Major inference frameworks:**

Framework	Company	Key Feature	Best For	Complexity
vLLM	UC Berkeley → community	PagedAttention, continuous batching	Pure LLM serving, max throughput	Medium
TGI (Text Generation Inference)	HuggingFace	Production-ready out of the box, metrics	HF ecosystem, Docker deploy	Low
Triton Inference Server	NVIDIA	Multi-model, multi-framework	Heterogeneous workloads (LLM + CV + audio)	High
TensorRT-LLM	NVIDIA	Max performance on NVIDIA GPUs	When every ms counts	High
Ollama	Community	Simplicity	Dev, demo, small production	Very low
SageMaker Endpoints	AWS	Managed, auto-scaling	AWS-first companies	Low (but expensive)
Ray Serve	Anyscale	Distributed computing	Complex pipelines, multi-model	Medium

**vLLM vs TGI** - the two main candidates for most projects. vLLM has slightly better throughput (especially at high concurrency), while TGI is simpler to set up and has better built-in metrics. In practice the difference is ~10-20% - the choice is more often determined by team familiarity and existing infrastructure.

A company is deploying a single LLM for a chatbot with 200 concurrent users. Infrastructure is Kubernetes on AWS. What to choose?

TGI: Production Deployment with Docker

**Text Generation Inference (TGI)** from HuggingFace is the most "batteries-included" solution on the market. One Docker container delivers: Prometheus metrics, health endpoints, OpenAI-compatible API, flash attention, quantization (AWQ, GPTQ), tensor parallelism across multiple GPUs. No boilerplate - everything out of the box.

The launch command looks long, but every flag matters. `--max-concurrent-requests 128` is a soft queue limit, not a hard cap. `--quantize awq` cuts VRAM in half with negligible quality loss. `--num-shard 2` enables tensor parallelism for 70B models that don't fit on a single GPU.

**Docker Compose for production with monitoring:**

**Integration with a Node.js backend** - TGI provides an OpenAI-compatible endpoint:

**TGI vs raw vLLM Docker:** TGI has built-in Prometheus metrics, structured logging, watermarking support, and better documentation. vLLM has slightly higher raw throughput. For most production deployments the difference is not critical - both work great.

In the docker-compose.yml for TGI, the healthcheck has start_period: 120s. Why?

vLLM: Continuous Batching and Throughput Optimization

June 2023. Woosuk Kwon and a team from UC Berkeley publish "Efficient Memory Management for Large Language Model Serving with PagedAttention". The core idea: store KV-cache as virtual memory pages - exactly how an OS manages RAM. The result is a 24x throughput increase over naive inference on the same hardware.

Think of it as the difference between blocking I/O and async. Only for GPU memory.

**PagedAttention in detail:**

The second innovation is **continuous batching**. Classic static batching waits for an entire batch to finish before accepting new requests. Continuous batching adds new requests the moment a slot frees up. The GPU never idles.

**Production deployment of vLLM** with optimal parameters:

**Throughput benchmark** (Llama 3.1 8B, A100 80GB):

Concurrent requests	Ollama (tok/s total)	vLLM (tok/s total)	TGI (tok/s total)
1	55	50	48
4	55	180	165
16	55	520	470
32	52	860	750
64	48 (queue!)	1200	1050
128	timeout	1400	1200

**Prefix caching** is an optimization for repeating system prompts. If all requests start the same way (system prompt, few-shot examples), vLLM computes the KV-cache for that prefix once and reuses it:

**Ollama throughput doesn't scale with concurrency** - it lacks continuous batching. At 64 concurrent requests, Ollama processes them sequentially (~48 tok/s total), while vLLM processes them in parallel (~1200 tok/s total). For production with >10 concurrent users - vLLM or TGI is mandatory.

One GPU handles one LLM request at a time - the rest just wait in line

Continuous batching lets a GPU handle dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish

The CPU-world intuition (one thread = one request) doesn't apply to GPU inference. GPUs are massively parallel: hundreds of streaming multiprocessors can work on different sequences in a single forward pass. Continuous batching leverages this - as soon as request A finishes, its KV-cache slot is freed and request D jumps in immediately. No waiting. At 64 concurrent users, Ollama (no CB) delivers 48 tok/s. vLLM (with CB) delivers 1200 tok/s. A 25x gap.

vLLM with --enable-prefix-caching serves a chatbot. All 1,000 requests per minute have the same system prompt (500 tokens). How many times does vLLM process that system prompt?

GPU Autoscaling: Kubernetes, Spot Instances, Cold Start

LLM inference has a characteristic load pattern: **spiky traffic** - peak hours hitting 5-10x above average. Provisioning GPUs for peak load around the clock is money left on the table. Autoscaling covers the gap - companies running A100 clusters save anywhere from 3,000 to 50,000 dollars a month depending on scale.

**The GPU autoscaling challenge:** unlike CPU workloads, LLM pods start **slowly** - 1-3 minutes to load a model into GPU. Standard HPA reacts in ~60 seconds, then the pod needs another 120 to come up. The traffic spike has already passed, users got timeouts, the pod just became ready.

**Kubernetes deployment with HPA:**

**Spot/preemptible instances** - 60-70% savings on GPUs:

GPU	On-Demand ($/hr)	Spot ($/hr)	Savings
NVIDIA L4	USD 0.70	USD 0.24	66%
NVIDIA A100 40GB	USD 3.67	USD 1.10	70%
NVIDIA A100 80GB	USD 4.38	USD 1.31	70%
NVIDIA H100	USD 8.50	USD 2.55	70%

**Spot instances can be reclaimed** with 30s notice. In-flight requests are interrupted. Solution: graceful shutdown handler (finish current requests, stop accepting new ones), retry middleware on the client, fallback to cloud API.

An LLM serving pod on Kubernetes takes 2 minutes to start (model loading). A traffic spike hits in 30 seconds. What strategy is best?

Monitoring Inference Servers: Metrics, Alerts, Debugging

An LLM inference server is not a regular HTTP service. A GPU costs 1-10 dollars an hour. If GPU utilization sits at 30% instead of 80%, that's not inefficiency - it's direct money loss. One poorly optimized A100-based inference server runs 10,000 dollars a month. An optimized one at the same throughput costs 2,000.

Standard metrics (CPU, memory, error rate) tell almost nothing useful about inference health. What matters is **tokens/s**, **KV-cache utilization**, **TTFT**, **avg batch size**. These are the signals that reveal the real system state - seconds before an OOM or throughput collapse.

**Key metrics:**

Metric	What It Measures	Healthy Range	Alert Threshold
TTFT (Time to First Token)	Delay before the first token	<200ms	>500ms (p95)
TPS (Tokens per Second)	Generation speed	50-150 tok/s per GPU	<30 tok/s
Request queue depth	Queue of waiting requests	0-10	>50 (scale up!)
GPU utilization	GPU compute load	60-90%	<30% (over-provisioned) or >95% (throttling)
GPU memory used	VRAM usage	80-90%	>95% (OOM risk)
KV-cache utilization	KV-cache fill level	50-80%	>90% (preemption risk)
Request error rate	Percentage of errors	<0.1%	>1%
Avg batch size	Average batch size	depends on load	Always 1 = no batching (problem)

**Custom monitoring middleware** for a Node.js backend:

**Typical failure modes** and their diagnostics:

Symptom	Cause	Diagnostics	Solution
TTFT increased 10x	GPU throttling (overheating >83°C)	nvidia-smi -q	Improve cooling, reduce batch size
All requests timeout	OOM - model crashed	dmesg \| grep oom	Reduce max-model-len or max-num-seqs
Throughput dropped 50%	KV-cache overflow, preemption	vLLM logs: preempting	Reduce max-num-seqs or add VRAM
Requests hang forever	GPU hang (driver issue)	nvidia-smi (processes stuck)	Restart container, update driver
Garbage output	Model corruption / wrong quant	Test: simple prompt	Redownload the model
Sporadic CUDA OOM	Long prompts with full KV-cache	Logs: max seq length	Limit max-input-length

**nvidia-smi** is the primary GPU diagnostics tool. The command `watch -n 1 nvidia-smi` shows utilization, memory, and temperature in real time. For production - **DCGM (Data Center GPU Manager)** exports all GPU metrics to Prometheus via dcgm-exporter.

vLLM logs show 'preempting 12 sequences'. GPU memory usage is at 95%. What's happening?

One GPU handles one LLM request at a time - the rest just wait in line

Continuous batching lets a GPU process dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish

The CPU-world intuition (one thread = one request) doesn't transfer to GPU inference. GPUs are massively parallel: hundreds of streaming multiprocessors can work on different sequences in a single forward pass. Continuous batching leverages this - the moment request A finishes, its KV-cache slot is freed and request D steps in immediately. No waiting at all. At 64 concurrent users, Ollama (without CB) delivers 48 tok/s. vLLM (with CB) delivers 1,200 tok/s. A 25x gap on identical hardware.

Key Takeaways

Serving landscape: Ollama (dev) → TGI (simple production) → vLLM (max throughput) → Triton (multi-model)
TGI: Docker one-liner, Prometheus metrics, health endpoints - batteries-included production deploy
vLLM (Kwon et al. 2023): PagedAttention (95% memory utilization) + continuous batching = 24x throughput vs naive
Prefix caching: shared system prompt computed once → massive savings with repeating prompts
GPU autoscaling: predictive (cron) + warm standby + queue buffer - covering cold start (2 min pod startup)
Spot instances: 60-70% savings, but require graceful shutdown, retry middleware, and fallback to cloud API
Monitoring: TTFT (p95 <500ms), TPS (>30/GPU), queue depth (<50), KV-cache (<90%)

Вопросы для размышления

PagedAttention solves VRAM fragmentation using virtual memory. What other OS patterns could apply to LLM inference - for example, swap, huge pages, or memory-mapped files?
Throughput vs latency tradeoff: larger batch size increases throughput but TTFT grows. How to pick the optimal max-num-seqs for a service with a p99 latency SLA of <500ms under uneven load?
Speculative decoding (a draft model generates tokens, the main model verifies) theoretically cuts latency without quality loss. Why isn't it the default in vLLM/TGI?

What's Next

Model serving is the final element of ML infrastructure. Next up - system design and specialized applications.

AI System Design — Model serving as part of an end-to-end AI system: routing, caching, serving, monitoring
Cost Management — Optimizing GPU costs through spot instances, quantization, model routing
Observability — Inference monitoring as part of the overall observability strategy for AI systems

Связанные уроки

aie-39-local-models — Serving scales local inference to production
aie-42-ai-system-design — Serving is a core AI system component
aie-29-cost-management — Batching and GPU use govern serving cost
aie-35-observability — Monitor latency and throughput in serving
ml-46-model-serving — Same serving patterns for ML models
sd-03-scalability

AI Engineering

Model Serving: Deploying Models to Production - TGI, vLLM, Triton, SageMaker

Цели урока

Choose an inference framework for a specific scenario: Ollama / TGI / vLLM / Triton
Deploy LLMs via TGI and vLLM in Docker with production-ready configuration
Understand continuous batching and PagedAttention - key optimizations for high throughput
Configure autoscaling for GPU workloads in Kubernetes accounting for the cold start (2 min) problem
Monitor inference servers: TTFT, TPS, KV-cache utilization, GPU health

HuggingFace TGI serves ChatUI for 30K+ concurrent users on an H100 cluster - continuous batching at the core
Perplexity AI: 100M+ requests/month on vLLM, PagedAttention cut their GPU fleet by 60% at the same throughput
Kwon et al. 2023 (UC Berkeley) - original vLLM/PagedAttention paper, 24x throughput vs naive approach
One unoptimized inference server costs 10,000 dollars/month; an optimized one - 2,000 at the same throughput
P99 TTFT <200ms is a hard requirement in fintech and healthcare - every second of latency costs real money

PagedAttention: how Berkeley students changed inference

Предварительные знания

Local LLM: Ollama, llama.cpp, vLLM - Running Models on Self-Hosted Hardware

Serving Landscape: TGI, vLLM, Triton, Managed Services

**Key requirements for production serving:**

**High throughput** - hundreds to thousands of requests per second on the minimum number of GPUs
**Low latency** - time to first token <200ms, end-to-end <2-5s for a typical response
**Horizontal scaling** - adding GPU nodes as load grows
**Health checks & graceful shutdown** - zero-downtime deploys
**Monitoring** - token throughput, queue depth, GPU utilization, error rate
**Model versioning** - A/B testing, canary deploys, rollback

**Major inference frameworks:**

Framework	Company	Key Feature	Best For	Complexity
vLLM	UC Berkeley → community	PagedAttention, continuous batching	Pure LLM serving, max throughput	Medium
TGI (Text Generation Inference)	HuggingFace	Production-ready out of the box, metrics	HF ecosystem, Docker deploy	Low
Triton Inference Server	NVIDIA	Multi-model, multi-framework	Heterogeneous workloads (LLM + CV + audio)	High
TensorRT-LLM	NVIDIA	Max performance on NVIDIA GPUs	When every ms counts	High
Ollama	Community	Simplicity	Dev, demo, small production	Very low
SageMaker Endpoints	AWS	Managed, auto-scaling	AWS-first companies	Low (but expensive)
Ray Serve	Anyscale	Distributed computing	Complex pipelines, multi-model	Medium

A company is deploying a single LLM for a chatbot with 200 concurrent users. Infrastructure is Kubernetes on AWS. What to choose?

TGI: Production Deployment with Docker

**Docker Compose for production with monitoring:**

**Integration with a Node.js backend** - TGI provides an OpenAI-compatible endpoint:

In the docker-compose.yml for TGI, the healthcheck has start_period: 120s. Why?

vLLM: Continuous Batching and Throughput Optimization

Think of it as the difference between blocking I/O and async. Only for GPU memory.

**PagedAttention in detail:**

**Production deployment of vLLM** with optimal parameters:

**Throughput benchmark** (Llama 3.1 8B, A100 80GB):

Concurrent requests	Ollama (tok/s total)	vLLM (tok/s total)	TGI (tok/s total)
1	55	50	48
4	55	180	165
16	55	520	470
32	52	860	750
64	48 (queue!)	1200	1050
128	timeout	1400	1200

One GPU handles one LLM request at a time - the rest just wait in line

Continuous batching lets a GPU handle dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish

vLLM with --enable-prefix-caching serves a chatbot. All 1,000 requests per minute have the same system prompt (500 tokens). How many times does vLLM process that system prompt?

GPU Autoscaling: Kubernetes, Spot Instances, Cold Start

**Kubernetes deployment with HPA:**

**Spot/preemptible instances** - 60-70% savings on GPUs:

GPU	On-Demand ($/hr)	Spot ($/hr)	Savings
NVIDIA L4	USD 0.70	USD 0.24	66%
NVIDIA A100 40GB	USD 3.67	USD 1.10	70%
NVIDIA A100 80GB	USD 4.38	USD 1.31	70%
NVIDIA H100	USD 8.50	USD 2.55	70%

An LLM serving pod on Kubernetes takes 2 minutes to start (model loading). A traffic spike hits in 30 seconds. What strategy is best?

Monitoring Inference Servers: Metrics, Alerts, Debugging

**Key metrics:**

Metric	What It Measures	Healthy Range	Alert Threshold
TTFT (Time to First Token)	Delay before the first token	<200ms	>500ms (p95)
TPS (Tokens per Second)	Generation speed	50-150 tok/s per GPU	<30 tok/s
Request queue depth	Queue of waiting requests	0-10	>50 (scale up!)
GPU utilization	GPU compute load	60-90%	<30% (over-provisioned) or >95% (throttling)
GPU memory used	VRAM usage	80-90%	>95% (OOM risk)
KV-cache utilization	KV-cache fill level	50-80%	>90% (preemption risk)
Request error rate	Percentage of errors	<0.1%	>1%
Avg batch size	Average batch size	depends on load	Always 1 = no batching (problem)

**Custom monitoring middleware** for a Node.js backend:

**Typical failure modes** and their diagnostics:

Symptom	Cause	Diagnostics	Solution
TTFT increased 10x	GPU throttling (overheating >83°C)	nvidia-smi -q	Improve cooling, reduce batch size
All requests timeout	OOM - model crashed	dmesg \| grep oom	Reduce max-model-len or max-num-seqs
Throughput dropped 50%	KV-cache overflow, preemption	vLLM logs: preempting	Reduce max-num-seqs or add VRAM
Requests hang forever	GPU hang (driver issue)	nvidia-smi (processes stuck)	Restart container, update driver
Garbage output	Model corruption / wrong quant	Test: simple prompt	Redownload the model
Sporadic CUDA OOM	Long prompts with full KV-cache	Logs: max seq length	Limit max-input-length

vLLM logs show 'preempting 12 sequences'. GPU memory usage is at 95%. What's happening?

One GPU handles one LLM request at a time - the rest just wait in line

Continuous batching lets a GPU process dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish

Key Takeaways

Serving landscape: Ollama (dev) → TGI (simple production) → vLLM (max throughput) → Triton (multi-model)
TGI: Docker one-liner, Prometheus metrics, health endpoints - batteries-included production deploy
vLLM (Kwon et al. 2023): PagedAttention (95% memory utilization) + continuous batching = 24x throughput vs naive
Prefix caching: shared system prompt computed once → massive savings with repeating prompts
GPU autoscaling: predictive (cron) + warm standby + queue buffer - covering cold start (2 min pod startup)
Spot instances: 60-70% savings, but require graceful shutdown, retry middleware, and fallback to cloud API
Monitoring: TTFT (p95 <500ms), TPS (>30/GPU), queue depth (<50), KV-cache (<90%)

Вопросы для размышления

PagedAttention solves VRAM fragmentation using virtual memory. What other OS patterns could apply to LLM inference - for example, swap, huge pages, or memory-mapped files?
Throughput vs latency tradeoff: larger batch size increases throughput but TTFT grows. How to pick the optimal max-num-seqs for a service with a p99 latency SLA of <500ms under uneven load?
Speculative decoding (a draft model generates tokens, the main model verifies) theoretically cuts latency without quality loss. Why isn't it the default in vLLM/TGI?

What's Next

Model serving is the final element of ML infrastructure. Next up - system design and specialized applications.

AI System Design — Model serving as part of an end-to-end AI system: routing, caching, serving, monitoring
Cost Management — Optimizing GPU costs through spot instances, quantization, model routing
Observability — Inference monitoring as part of the overall observability strategy for AI systems

Связанные уроки

aie-39-local-models — Serving scales local inference to production
aie-42-ai-system-design — Serving is a core AI system component
aie-29-cost-management — Batching and GPU use govern serving cost
aie-35-observability — Monitor latency and throughput in serving
ml-46-model-serving — Same serving patterns for ML models
sd-03-scalability