Deep Learning
Deep Learning System Design
A model that works in a Jupyter notebook and a model that reliably serves millions of users in production are almost entirely different engineering problems. OpenAI's GPT-4 is estimated to run on 1000+ A100 GPUs serving ~13 million daily active users, with 99.9% uptime SLA and p99 latency under 2 seconds. Google's recommendation models serve 8.5 billion search queries per day. The machine learning system design interview at FAANG companies tests exactly this gap: not whether you can train a neural network, but whether you can design a reliable, scalable, monitorable production system around it.
- **Netflix's recommendation system** uses a multi-stage DL pipeline (candidate generation -> ranking -> reranking) serving 220 million subscribers with A/B tests running continuously across 1000+ model variants - the serving infrastructure itself is as sophisticated as the models.
- **Uber's Michelangelo** ML platform handles the full ML lifecycle for hundreds of models across all Uber products - from feature engineering and training to serving and monitoring - processing millions of predictions per second for surge pricing, fraud detection, and ETA estimation.
- **Waymo's ML infrastructure** runs continuous retraining loops across millions of miles of driving data, with strict canary deployment protocols that test new models in simulation, shadow mode, and limited geographic rollout before production - because model failures have direct safety consequences.
The paper that named ML's hidden costs
In 2015 D. Sculley and colleagues at Google published 'Hidden Technical Debt in Machine Learning Systems', arguing that the model code is a tiny fraction of a real ML system, surrounded by configuration, data pipelines, feature management, monitoring, and serving glue. The paper named anti-patterns engineers had felt but not articulated: glue code, pipeline jungles, undeclared consumers, and feedback loops. It became the intellectual foundation of the MLOps discipline, reframing production ML as a systems engineering problem rather than a modeling one.
Предварительные знания
Training Infrastructure
Large-scale DL training infrastructure must solve: GPU cluster management (scheduling, fault tolerance, checkpointing), distributed training coordination (data parallel, model parallel, pipeline parallel), storage throughput (streaming training data to thousands of GPUs), and experiment tracking (metrics, artifacts, hyperparameters). A single training run of GPT-4 used ~25,000 A100 GPUs over 90+ days - any infrastructure failure without automatic recovery wastes millions of dollars.
Model FLOP Utilization (MFU) measures what fraction of the theoretical hardware peak is actually used for model computation. GPT-3 training at OpenAI achieved ~45% MFU on A100s; well-optimized code with FlashAttention and tensor parallelism reaches 55-65%. Low MFU indicates communication bottlenecks, memory bandwidth limits, or suboptimal kernel choices.
Fault tolerance is non-optional for multi-week training runs. Llama training uses elastic checkpointing: when a node fails, the job resumes from the last checkpoint with the remaining nodes, adjusting batch size proportionally. Without this, a single GPU failure in a 1000-GPU job every 3 days would cause ~10% of total compute to be wasted on re-runs.
What does Model FLOP Utilization (MFU) measure, and why is achieving >45% considered good for LLM training?
Model Serving Infrastructure
ML model serving at scale requires: a serving framework (Triton Inference Server, TorchServe, vLLM for LLMs), a load balancer, auto-scaling based on request queue depth, model versioning, A/B traffic splitting, and latency/throughput SLAs. The key serving metrics: throughput (tokens/second for LLMs, images/second for vision), latency (time-to-first-token, end-to-end latency), and GPU utilization.
vLLM (Kwon et al., UC Berkeley 2023) is the dominant LLM serving framework, using PagedAttention to manage the KV cache as paged virtual memory. KV cache reuse (prefix caching) is critical for chatbot applications where system prompts are identical across users - vLLM caches system prompt KV tensors and reuses them across requests, reducing time-to-first-token by 50% for repeated prefixes.
Speculative decoding improves LLM serving throughput by 2-3x: a small draft model (7B) generates K candidate tokens in parallel; the large target model (70B) verifies all K tokens in a single forward pass (cheaper than K sequential passes). When all tokens are accepted (common for easy text), the 70B model generates K tokens for the cost of 1 forward pass.
How does vLLM's PagedAttention improve LLM serving throughput over naive implementation?
ML Pipeline and Experiment Management
A production ML pipeline is more than training and deploying a model: it includes data ingestion, feature engineering, data validation (Great Expectations, Deequ), model training (with hyperparameter optimization), model evaluation, model registry (MLflow, W&B artifacts), and deployment. MLOps platforms (Kubeflow, MLflow, SageMaker Pipelines) orchestrate these steps with versioning, reproducibility, and failure handling.
Feature stores (Feast, Tecton, Vertex Feature Store) solve the training-serving skew problem: during training, features are computed from historical data in batch; during serving, the same features must be computed in real-time consistently. Inconsistency causes 'training-serving skew' - a model that performs well offline but degrades in production because features are computed differently.
Shadow mode deployment is critical for DL system safety: the new model receives all production traffic but its outputs are discarded - only the old model's outputs are returned to users. This enables collecting real-world performance data and comparing outputs without any user impact before the canary release.
What is training-serving skew and why is it a critical problem in production ML?
Production Monitoring
Production DL systems degrade silently over time through data drift (input distribution changes), concept drift (the relationship between inputs and correct outputs changes), and model staleness (world facts change but the model does not update). Monitoring must detect these before they cause business impact: typically serving quality degrades weeks before users complain.
Model monitoring has three layers: (1) infrastructure monitoring (latency, throughput, GPU utilization, error rates); (2) data monitoring (input feature distributions vs. training baseline, embedding drift using cosine similarity or population stability index); (3) quality monitoring (output distribution, prediction confidence, human evaluation sampling). Each layer requires different tooling and different alerting thresholds.
Population Stability Index (PSI = sum((actual% - expected%) * ln(actual%/expected%)) over bins) is the standard metric for detecting distribution shift in production ML: PSI < 0.1 = no significant change, 0.1-0.25 = minor shift (monitor), > 0.25 = major shift (investigate, likely retrain).
A model that performs well in offline evaluation will perform equally well in production
Offline-online gaps of 5-20% are common due to training-serving skew, distribution shift, and the difference between held-out evaluation sets and true production traffic
Offline test sets are static snapshots of historical data; production traffic evolves continuously. Features may be computed differently, users may behave differently than the evaluation cohort, and the world changes while the model does not
Why is monitoring input data distributions (not just model outputs) important for production DL systems?
Key Ideas
- **Training infrastructure** is measured by MFU (Model FLOP Utilization, target > 45%), requires elastic fault tolerance for multi-week runs, and demands careful data pipeline design to prevent GPU starvation.
- **LLM serving** with vLLM's PagedAttention achieves 12x throughput over naive serving via KV cache memory management; speculative decoding adds another 2-3x for high-acceptance-rate workloads.
- **Production monitoring** requires three layers: infrastructure (latency/errors), data quality (distribution drift via PSI), and model quality (sampled human evaluation) - offline metrics alone are insufficient.
Related Topics
DL system design synthesizes training, compression, and deployment:
- Quantization and Pruning — Model compression decisions (INT8, FP16, distillation) are made within the serving system context - latency SLAs and hardware constraints determine which techniques are applied
- Distributed Training: Scaling to a Cluster — The training infrastructure design uses the data-parallel, model-parallel, and pipeline-parallel strategies covered in the distributed training lesson
Вопросы для размышления
- Design the serving infrastructure for a real-time content recommendation system that must serve 10 million users simultaneously with p99 < 100ms using a 7B parameter model. What are the critical components and bottlenecks?
- A production DL model's output distribution shifts significantly on Mondays (more short queries) vs. Fridays (more long queries). How would monitoring detect this, and would a retrain be warranted?
- When designing a shadow mode deployment for a new image classification model, what metrics would indicate the new model is ready for canary release, and what threshold would trigger an abort?
Связанные уроки
- dl-12 — Training infrastructure builds on distributed training
- dl-19 — Serving relies on quantization to cut latency and cost
- dl-21 — System design knowledge powers interview scaling questions
- ml-55-ml-system-design — Same end-to-end ML system design discipline
- ml-47-model-monitoring — Production monitoring detects drift and degradation
- ml-45-mlops-pipeline — Pipelines and experiment tracking structure the workflow