Generative AI
Serving LLM: vLLM, TGI
Mistral AI serves its models at $0.14/million tokens. The enabling technology is vLLM-style continuous batching and PagedAttention on A100/H100 clusters. The same Llama 3 70B model on a naive PyTorch serving stack costs 10-20x more per token. Production LLM serving is a systems engineering discipline that sits between ML research and cloud infrastructure.
- Anyscale benchmarks showed vLLM achieves 23x higher throughput than HuggingFace Transformers naive serving for the same GPU capacity, primarily through continuous batching.
- Hugging Face TGI (Text Generation Inference) serves public inference for hundreds of models on the HuggingFace Hub. It powers the HuggingFace Inference API used by thousands of companies.
- Together AI serves 200+ open-source models using an optimized inference cluster. Their per-token cost is 90% below OpenAI API pricing for equivalent model quality, enabled by optimized batching and quantization.
Предварительные знания
vLLM, PagedAttention, and Continuous Batching
In 2022 researchers introduced Orca with the idea of continuous batching: instead of waiting for every request in a batch to finish, the scheduler injects new requests at each generation step and keeps the GPU busy. In 2023 Woosuk Kwon and a team at UC Berkeley released vLLM with PagedAttention, a mechanism that manages the KV cache like an operating system manages virtual memory, in pages, eliminating fragmentation and sharply raising throughput. Around the same time HuggingFace released Text Generation Inference (TGI), a production server for its models. These projects defined what LLM serving looks like today: high GPU utilization, low latency, and a per-token cost that fell several-fold.
vLLM
**vLLM** is a key technique in Serving LLM: vLLM, TGI. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
vLLM is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
What problem does vLLM primarily solve in generative AI systems?
Text Generation Inference (TGI)
**Text Generation Inference (TGI)** is a key technique in Serving LLM: vLLM, TGI. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
Text Generation Inference (TGI) is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
What problem does Text Generation Inference (TGI) primarily solve in generative AI systems?
Tensor Parallelism
**Tensor Parallelism** is a key technique in Serving LLM: vLLM, TGI. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
Tensor Parallelism is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
What problem does Tensor Parallelism primarily solve in generative AI systems?
Continuous Batching Architecture
**Continuous Batching Architecture** is a key technique in Serving LLM: vLLM, TGI. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
Continuous Batching Architecture is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
Serving LLM: vLLM, TGI requires specialized AI research expertise unavailable to most engineering teams
Serving LLM: vLLM, TGI is implementable with standard open-source tools and cloud APIs; the key skill is understanding the trade-offs and when to apply each technique
The LLM ecosystem (vLLM, trl, Langchain, LlamaIndex, Instructor) has productized most generative AI patterns. The engineering challenge is choosing the right tools and understanding their failure modes - not building from scratch.
What problem does Continuous Batching Architecture primarily solve in generative AI systems?
Key Ideas
- **vLLM:** UC Berkeley open-source LLM serving system with PagedAttention (manages KV cache in non-contiguous pages like OS virtual memory), continuous batching, and OpenAI-compatible API
- **TGI:** HuggingFace production serving system with flash attention, tensor parallelism, and Rust-based HTTP server; standard in HuggingFace ecosystem
- **Tensor parallelism:** splits model weight matrices across multiple GPUs; enables models too large for one GPU; requires high-bandwidth interconnect (NVLink, InfiniBand)
- **Continuous batching:** requests join/leave the batch dynamically as generation completes; eliminates idle GPU time waiting for the longest sequence in a static batch
Related Topics
These topics form the surrounding Serving LLM: vLLM, TGI ecosystem:
- Inference Optimization — vLLM and TGI implement quantization, KV caching, and continuous batching as a unified serving stack
- GenAI System Design — Serving infrastructure decisions (vLLM vs TGI vs Triton, horizontal vs vertical scaling) are core to LLM system design interviews
- Multi-Agent Systems — High-throughput serving is the prerequisite for cost-effective multi-agent systems at production scale
Вопросы для размышления
- How does Serving LLM: vLLM, TGI change when moving from a prototype to a production system serving 1 million users?
- What are the primary failure modes in Serving LLM: vLLM, TGI and what monitoring catches them before users are affected?
- How would you explain the trade-offs in Serving LLM: vLLM, TGI to a non-technical stakeholder who needs to approve the infrastructure budget?
Связанные уроки
- gai-19 — Serving stacks implement inference optimizations
- gai-23 — Serving is a building block of GenAI system design
- aie-40-model-serving — Production model serving and deployment
- ml-54-distributed-training — Tensor parallelism shards models like distributed training
- ml-46-model-serving — LLM serving generalizes ML model serving
- sd-03-scalability