AI Engineering

Model Serving: Deploying Models to Production - TGI, vLLM, Triton, SageMaker

Цели урока

  • Choose an inference framework for a specific scenario: Ollama / TGI / vLLM / Triton
  • Deploy LLMs via TGI and vLLM in Docker with production-ready configuration
  • Understand continuous batching and PagedAttention - key optimizations for high throughput
  • Configure autoscaling for GPU workloads in Kubernetes accounting for the cold start (2 min) problem
  • Monitor inference servers: TTFT, TPS, KV-cache utilization, GPU health

vLLM delivers 24x throughput over naive inference through PagedAttention. The gap is the same as blocking I/O vs async - only for GPU memory. One Ollama container collapses at 50 concurrent requests. vLLM handles 1,400 tok/s on the same A100 at 128 concurrent users. Autoscaling on spot instances saves 70% on GPU costs. All of this - no model changes, no fine-tuning, just the right serving infrastructure.

  • HuggingFace TGI serves ChatUI for 30K+ concurrent users on an H100 cluster - continuous batching at the core
  • Perplexity AI: 100M+ requests/month on vLLM, PagedAttention cut their GPU fleet by 60% at the same throughput
  • Kwon et al. 2023 (UC Berkeley) - original vLLM/PagedAttention paper, 24x throughput vs naive approach
  • One unoptimized inference server costs 10,000 dollars/month; an optimized one - 2,000 at the same throughput
  • P99 TTFT <200ms is a hard requirement in fintech and healthcare - every second of latency costs real money

PagedAttention: how Berkeley students changed inference

June 2023. Woosuk Kwon, Zhuohan Li, and colleagues at UC Berkeley publish "Efficient Memory Management for Large Language Model Serving with PagedAttention". The idea is simple to the point of genius: manage KV-cache as virtual memory pages - exactly the way an OS has managed RAM since the 1960s. Before this, every request pre-reserved a maximum memory block regardless of actual generation length. 70% of VRAM was wasted. PagedAttention allocates pages on demand and frees them the moment a request finishes. Result: 24x throughput on a single GPU. vLLM shipped as the open-source implementation and became the de-facto production inference standard within months.

Предварительные знания

  • Local LLM: Ollama, llama.cpp, vLLM - Running Models on Self-Hosted Hardware

Serving Landscape: TGI, vLLM, Triton, Managed Services

Running a model on a laptop via Ollama takes five minutes. Serving 10,000 concurrent users with a 99.9% SLA and p99 latency under two seconds is a completely different discipline. The gap between these two worlds is called **model serving** - the infrastructure layer between model weights and production traffic.

A naive deployment (one Docker container with Ollama) collapses around 50 concurrent requests. Not because the model is bad. Because Ollama processes requests **sequentially** - no continuous batching, no PagedAttention, no KV-cache management. It's like blocking I/O instead of async: one request occupies the GPU, the rest queue up and wait.

**Key requirements for production serving:**

  • **High throughput** - hundreds to thousands of requests per second on the minimum number of GPUs
  • **Low latency** - time to first token <200ms, end-to-end <2-5s for a typical response
  • **Horizontal scaling** - adding GPU nodes as load grows
  • **Health checks & graceful shutdown** - zero-downtime deploys
  • **Monitoring** - token throughput, queue depth, GPU utilization, error rate
  • **Model versioning** - A/B testing, canary deploys, rollback

**Major inference frameworks:**

FrameworkCompanyKey FeatureBest ForComplexity
vLLMUC Berkeley → communityPagedAttention, continuous batchingPure LLM serving, max throughputMedium
TGI (Text Generation Inference)HuggingFaceProduction-ready out of the box, metricsHF ecosystem, Docker deployLow
Triton Inference ServerNVIDIAMulti-model, multi-frameworkHeterogeneous workloads (LLM + CV + audio)High
TensorRT-LLMNVIDIAMax performance on NVIDIA GPUsWhen every ms countsHigh
OllamaCommunitySimplicityDev, demo, small productionVery low
SageMaker EndpointsAWSManaged, auto-scalingAWS-first companiesLow (but expensive)
Ray ServeAnyscaleDistributed computingComplex pipelines, multi-modelMedium

**vLLM vs TGI** - the two main candidates for most projects. vLLM has slightly better throughput (especially at high concurrency), while TGI is simpler to set up and has better built-in metrics. In practice the difference is ~10-20% - the choice is more often determined by team familiarity and existing infrastructure.

A company is deploying a single LLM for a chatbot with 200 concurrent users. Infrastructure is Kubernetes on AWS. What to choose?

TGI: Production Deployment with Docker

**Text Generation Inference (TGI)** from HuggingFace is the most "batteries-included" solution on the market. One Docker container delivers: Prometheus metrics, health endpoints, OpenAI-compatible API, flash attention, quantization (AWQ, GPTQ), tensor parallelism across multiple GPUs. No boilerplate - everything out of the box.

The launch command looks long, but every flag matters. `--max-concurrent-requests 128` is a soft queue limit, not a hard cap. `--quantize awq` cuts VRAM in half with negligible quality loss. `--num-shard 2` enables tensor parallelism for 70B models that don't fit on a single GPU.

**Docker Compose for production with monitoring:**

**Integration with a Node.js backend** - TGI provides an OpenAI-compatible endpoint:

**TGI vs raw vLLM Docker:** TGI has built-in Prometheus metrics, structured logging, watermarking support, and better documentation. vLLM has slightly higher raw throughput. For most production deployments the difference is not critical - both work great.

In the docker-compose.yml for TGI, the healthcheck has start_period: 120s. Why?

vLLM: Continuous Batching and Throughput Optimization

June 2023. Woosuk Kwon and a team from UC Berkeley publish "Efficient Memory Management for Large Language Model Serving with PagedAttention". The core idea: store KV-cache as virtual memory pages - exactly how an OS manages RAM. The result is a 24x throughput increase over naive inference on the same hardware.

Think of it as the difference between blocking I/O and async. Only for GPU memory.

**PagedAttention in detail:**

The second innovation is **continuous batching**. Classic static batching waits for an entire batch to finish before accepting new requests. Continuous batching adds new requests the moment a slot frees up. The GPU never idles.

**Production deployment of vLLM** with optimal parameters:

**Throughput benchmark** (Llama 3.1 8B, A100 80GB):

Concurrent requestsOllama (tok/s total)vLLM (tok/s total)TGI (tok/s total)
1555048
455180165
1655520470
3252860750
6448 (queue!)12001050
128timeout14001200

**Prefix caching** is an optimization for repeating system prompts. If all requests start the same way (system prompt, few-shot examples), vLLM computes the KV-cache for that prefix once and reuses it:

**Ollama throughput doesn't scale with concurrency** - it lacks continuous batching. At 64 concurrent requests, Ollama processes them sequentially (~48 tok/s total), while vLLM processes them in parallel (~1200 tok/s total). For production with >10 concurrent users - vLLM or TGI is mandatory.

One GPU handles one LLM request at a time - the rest just wait in line

Continuous batching lets a GPU handle dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish

The CPU-world intuition (one thread = one request) doesn't apply to GPU inference. GPUs are massively parallel: hundreds of streaming multiprocessors can work on different sequences in a single forward pass. Continuous batching leverages this - as soon as request A finishes, its KV-cache slot is freed and request D jumps in immediately. No waiting. At 64 concurrent users, Ollama (no CB) delivers 48 tok/s. vLLM (with CB) delivers 1200 tok/s. A 25x gap.

vLLM with --enable-prefix-caching serves a chatbot. All 1,000 requests per minute have the same system prompt (500 tokens). How many times does vLLM process that system prompt?

GPU Autoscaling: Kubernetes, Spot Instances, Cold Start

LLM inference has a characteristic load pattern: **spiky traffic** - peak hours hitting 5-10x above average. Provisioning GPUs for peak load around the clock is money left on the table. Autoscaling covers the gap - companies running A100 clusters save anywhere from 3,000 to 50,000 dollars a month depending on scale.

**The GPU autoscaling challenge:** unlike CPU workloads, LLM pods start **slowly** - 1-3 minutes to load a model into GPU. Standard HPA reacts in ~60 seconds, then the pod needs another 120 to come up. The traffic spike has already passed, users got timeouts, the pod just became ready.

**Kubernetes deployment with HPA:**

**Spot/preemptible instances** - 60-70% savings on GPUs:

GPUOn-Demand ($/hr)Spot ($/hr)Savings
NVIDIA L4USD 0.70USD 0.2466%
NVIDIA A100 40GBUSD 3.67USD 1.1070%
NVIDIA A100 80GBUSD 4.38USD 1.3170%
NVIDIA H100USD 8.50USD 2.5570%

**Spot instances can be reclaimed** with 30s notice. In-flight requests are interrupted. Solution: graceful shutdown handler (finish current requests, stop accepting new ones), retry middleware on the client, fallback to cloud API.

An LLM serving pod on Kubernetes takes 2 minutes to start (model loading). A traffic spike hits in 30 seconds. What strategy is best?

Monitoring Inference Servers: Metrics, Alerts, Debugging

An LLM inference server is not a regular HTTP service. A GPU costs 1-10 dollars an hour. If GPU utilization sits at 30% instead of 80%, that's not inefficiency - it's direct money loss. One poorly optimized A100-based inference server runs 10,000 dollars a month. An optimized one at the same throughput costs 2,000.

Standard metrics (CPU, memory, error rate) tell almost nothing useful about inference health. What matters is **tokens/s**, **KV-cache utilization**, **TTFT**, **avg batch size**. These are the signals that reveal the real system state - seconds before an OOM or throughput collapse.

**Key metrics:**

MetricWhat It MeasuresHealthy RangeAlert Threshold
TTFT (Time to First Token)Delay before the first token<200ms>500ms (p95)
TPS (Tokens per Second)Generation speed50-150 tok/s per GPU<30 tok/s
Request queue depthQueue of waiting requests0-10>50 (scale up!)
GPU utilizationGPU compute load60-90%<30% (over-provisioned) or >95% (throttling)
GPU memory usedVRAM usage80-90%>95% (OOM risk)
KV-cache utilizationKV-cache fill level50-80%>90% (preemption risk)
Request error ratePercentage of errors<0.1%>1%
Avg batch sizeAverage batch sizedepends on loadAlways 1 = no batching (problem)

**Custom monitoring middleware** for a Node.js backend:

**Typical failure modes** and their diagnostics:

SymptomCauseDiagnosticsSolution
TTFT increased 10xGPU throttling (overheating >83°C)nvidia-smi -qImprove cooling, reduce batch size
All requests timeoutOOM - model crasheddmesg | grep oomReduce max-model-len or max-num-seqs
Throughput dropped 50%KV-cache overflow, preemptionvLLM logs: preemptingReduce max-num-seqs or add VRAM
Requests hang foreverGPU hang (driver issue)nvidia-smi (processes stuck)Restart container, update driver
Garbage outputModel corruption / wrong quantTest: simple promptRedownload the model
Sporadic CUDA OOMLong prompts with full KV-cacheLogs: max seq lengthLimit max-input-length

**nvidia-smi** is the primary GPU diagnostics tool. The command `watch -n 1 nvidia-smi` shows utilization, memory, and temperature in real time. For production - **DCGM (Data Center GPU Manager)** exports all GPU metrics to Prometheus via dcgm-exporter.

vLLM logs show 'preempting 12 sequences'. GPU memory usage is at 95%. What's happening?

One GPU handles one LLM request at a time - the rest just wait in line

Continuous batching lets a GPU process dozens of requests simultaneously: a new request takes a freed slot immediately, without waiting for an entire batch to finish

The CPU-world intuition (one thread = one request) doesn't transfer to GPU inference. GPUs are massively parallel: hundreds of streaming multiprocessors can work on different sequences in a single forward pass. Continuous batching leverages this - the moment request A finishes, its KV-cache slot is freed and request D steps in immediately. No waiting at all. At 64 concurrent users, Ollama (without CB) delivers 48 tok/s. vLLM (with CB) delivers 1,200 tok/s. A 25x gap on identical hardware.

Key Takeaways

  • Serving landscape: Ollama (dev) → TGI (simple production) → vLLM (max throughput) → Triton (multi-model)
  • TGI: Docker one-liner, Prometheus metrics, health endpoints - batteries-included production deploy
  • vLLM (Kwon et al. 2023): PagedAttention (95% memory utilization) + continuous batching = 24x throughput vs naive
  • Prefix caching: shared system prompt computed once → massive savings with repeating prompts
  • GPU autoscaling: predictive (cron) + warm standby + queue buffer - covering cold start (2 min pod startup)
  • Spot instances: 60-70% savings, but require graceful shutdown, retry middleware, and fallback to cloud API
  • Monitoring: TTFT (p95 <500ms), TPS (>30/GPU), queue depth (<50), KV-cache (<90%)

Вопросы для размышления

  • PagedAttention solves VRAM fragmentation using virtual memory. What other OS patterns could apply to LLM inference - for example, swap, huge pages, or memory-mapped files?
  • Throughput vs latency tradeoff: larger batch size increases throughput but TTFT grows. How to pick the optimal max-num-seqs for a service with a p99 latency SLA of <500ms under uneven load?
  • Speculative decoding (a draft model generates tokens, the main model verifies) theoretically cuts latency without quality loss. Why isn't it the default in vLLM/TGI?

What's Next

Model serving is the final element of ML infrastructure. Next up - system design and specialized applications.

  • AI System Design — Model serving as part of an end-to-end AI system: routing, caching, serving, monitoring
  • Cost Management — Optimizing GPU costs through spot instances, quantization, model routing
  • Observability — Inference monitoring as part of the overall observability strategy for AI systems

Связанные уроки

  • aie-39-local-models — Serving scales local inference to production
  • aie-42-ai-system-design — Serving is a core AI system component
  • aie-29-cost-management — Batching and GPU use govern serving cost
  • aie-35-observability — Monitor latency and throughput in serving
  • ml-46-model-serving — Same serving patterns for ML models
  • sd-03-scalability
Model Serving: Deploying Models to Production - TGI, vLLM, Triton, SageMaker

0

1

Sign In