AI Engineering

Model Distillation: Making a Small Model as Smart as a Large One

Цели урока

  • Understand knowledge distillation: teacher-student paradigm, soft labels, temperature scaling
  • Build the pipeline: seed generation → teacher inference → quality filtering → JSONL
  • Implement end-to-end distillation: GPT-4o → Llama 8B with QLoRA
  • Know when distillation beats fine-tuning and how to combine them

DeepSeek R1 distilled knowledge from a larger model - and got 90% quality at 10% inference cost. Distillation is when a large model teaches a small one to think the same way. GPT-4o-mini is a distillate of GPT-4. Stanford Alpaca 2023: USD 500 and one weekend - LLaMA 7B at GPT-3.5 level. Microsoft Orca: systematic distillation from GPT-4 produced a 13B model competing with ChatGPT. One engineer with a dinner budget can build a model that handles a specific production task better than the flagship.

  • GPT-4o-mini (2024) - officially a distillate of GPT-4, 16x cheaper with comparable quality on most tasks
  • DeepSeek R1 (2025) - cascaded distillation of reasoning from larger models: 90% quality, 10% cost
  • Stanford Alpaca: 52K examples from GPT-3.5 for USD 500 → LLaMA 7B at GPT-3.5 level
  • Enterprise pattern: distilled models handle 80% of requests, GPT-4o as fallback for the hard 20%

From Hinton to DeepSeek

**2015: Hinton, Vinyals, Dean** - "Distilling the Knowledge in a Neural Network". Soft labels with temperature scaling: instead of a hard label, the teacher passes a probability distribution. One idea - and the small model gets 10x more signal. **2023: Stanford Alpaca** - 52K synthetic examples for USD 500, LLaMA 7B competes with GPT-3.5. The era of mass LLM distillation opens. **2024: GPT-4o-mini** - OpenAI officially calls it a distillate. 16x cheaper than GPT-4o at 90%+ quality on standard tasks. **2025: DeepSeek R1** - cascaded distillation of reasoning. 671B → 70B → 8B. Each step retains 85-90% of the previous model's quality. In 10 years, distillation went from an academic technique to a core production AI strategy.

Предварительные знания

  • Fine-tuning: The Last Resort, Not the First - Training a Model on Custom Data
  • Open Source Models: Llama, Mistral, Qwen, Gemma - Choosing an Alternative to GPT

Knowledge Distillation: Transferring Knowledge Between Models

2015. Geoffrey Hinton publishes "Distilling the Knowledge in a Neural Network". The idea is simple and explosive: instead of training a small model on correct answers - train it to **think like a large one**. Teacher-student. The teacher (large model) passes the student (small model) not just "the correct class" but a probability distribution across all classes.

What's the difference? A hard label says: "this is a cat". A soft label from the teacher says: "cat 0.85, lynx 0.09, dog 0.04". These **soft probabilities** carry information about the structure of the problem - how similar the classes are to each other. The student gets 10x more signal from the same dataset. This is the magic of Hinton's distillation.

In 2024-2025, distillation reached a new level. GPT-4o-mini is a distillate of GPT-4. DeepSeek R1 distilled knowledge from larger reasoning models and achieved 90% quality at 10% inference cost. In the LLM context the mechanics differ: there's no access to a closed model's logits, so the student learns from **teacher-generated texts** via Supervised Fine-Tuning.

Why does distillation work? GPT-4o has ~1.8T parameters. Most of them encode "general knowledge" - grammar, logic, world facts. For a specific task (ticket classification, summarization), only a small fraction is needed. A small model can't store everything GPT-4o knows - but it can learn the patterns for one task. Distillation is the art of **squeezing what's needed from a large model into a small one**.

MetricGPT-4o (teacher)Llama 8B (base)Llama 8B (distilled)Savings
Accuracy94%72%89%-
Latency (p50)800ms80ms80ms10x
Cost per 1K reqUSD 2.50USD 0 (local)USD 0 (local)100%
Monthly (100K req/day)USD 7,500~USD 800 (GPU)~USD 800 (GPU)9x

**Real case: Stanford Alpaca (2023).** 52K examples generated by text-davinci-003 for USD 500. LLaMA 7B fine-tuned on this data showed quality comparable to GPT-3.5. One weekend, 500 dollars - and an open-source model competes with OpenAI's flagship. This opened the era of mass distillation.

What's the difference between LLM distillation (2024-2026) and classical knowledge distillation (Hinton 2015)?

Teacher-Student Pipeline: Generating Synthetic Data

The most critical part of distillation is generating high-quality training data. The principle is simple: ask the teacher the same questions the student will face in production, collect high-quality answers. Garbage in, garbage out. The student imitates the teacher - including its mistakes.

VolumeGPT-4o cost (generation)GPT-4o-mini cost (filtering)Time
1,000 examples~USD 6~USD 0.70~15 min
5,000 examples~USD 30~USD 3.40~1 hr
10,000 examples~USD 60~USD 6.80~2 hrs
50,000 examples~USD 300~USD 34~10 hrs

**Terms of Service.** OpenAI Terms prohibit using GPT-4 output to train **competing** models. A specialized model for internal use is allowed. Always check the provider's ToS before distillation.

The teacher (GPT-4o) generated 5,000 responses. 15% contain errors. What happens if the student trains without filtering?

End-to-End Pipeline: From GPT-4o to Llama 8B

Full pipeline - from task definition to deployed student model. Example: classifying support tickets into 12 categories. GPT-4o processes one ticket for ~USD 0.003. At 100K requests per day that's USD 300 per day - USD 9,000/month. A distilled Llama 8B on one A100 GPU (USD 2 per hour), with 50ms latency, handles the same load for USD 1,440/month. At 89% accuracy vs 94%.

**Step 3: Fine-tuning the student** with QLoRA via Unsloth:

In the pipeline, synthetic data comes from GPT-4o, and the student is fine-tuned via QLoRA on Llama 8B. Why QLoRA?

When Distillation Beats Fine-Tuning and Vice Versa

Distillation and fine-tuning are different tools. Fine-tuning uses **real data** (human-labeled), distillation uses **synthetic data** from the teacher. The choice depends on resources and the task. Neither approach always wins - but there's a clear decision framework.

CriterionFine-tuning (real data)Distillation (synthetic)
Data requirementsLabeled examples from humansOnly teacher + prompts
Data costUSD 0.10-5/example (annotation)~USD 0.003 per example (API)
Quality ceilingLimited by human labelsLimited by teacher model
Iteration speedSlow (annotation)Fast (hours)
Domain expertiseNeed domain expertsTeacher must understand the domain
ScalingExpensive (more annotators)Cheap (more API calls)
Unique tasksBetter (experts know nuances)Worse (teacher may not know)

The **combined approach** often yields the best results:

ApproachAccuracyCostTime to deploy
GPT-4o zero-shot94%USD 0 (pay per use)1 day
Fine-tuning only (300 human)85%~USD 1,6003 weeks
Distillation only (5K synthetic)89%~USD 803 days
Combined (5K + 300 human)92%~USD 1,6803 weeks
Combined + alignment stage93%~USD 1,7003.5 weeks

**Progressive Distillation** - an advanced technique: GPT-4o → Llama 70B → Llama 8B. The intermediate model generates more "relevant" training data for the small model because it's architecturally closer. DeepSeek R1 used a similar cascading approach.

Task: summarizing medical reports. There are 200 expert-labeled examples and a budget for the GPT-4o API. Best approach?

Distillation is just fine-tuning on teacher outputs

Classical distillation (Hinton 2015) is fundamentally different: the student learns from soft labels with temperature scaling, not from correct answers

A hard label "cat" carries 1 bit of information. A soft label "cat: 0.85, lynx: 0.09, tiger: 0.04" carries information about the **structure of the class space** - how similar the classes are to each other. Temperature scaling (T=4) softens the distribution further, making small probabilities significant. The student gains knowledge about semantic proximity between concepts - knowledge that's completely lost in hard labels. In the LLM context, when there's no access to logits, SFT on texts is used - this is closer to imitation than to classical distillation. GPT-4o-mini was trained with access to soft labels - hence its advantage over naive SFT.

Key Takeaways

  • Hinton 2015: soft labels with temperature carry task structure, not just the correct answer
  • LLM distillation (2024+): teacher generates synthetic data → student fine-tuned via SFT
  • GPT-4o-mini is a distillate of GPT-4; DeepSeek R1 uses cascaded reasoning distillation
  • Pipeline: seed generation → teacher inference (temperature=0) → quality filtering → JSONL
  • Student achieves 90-95% quality with 10x latency reduction and USD 0 inference cost
  • Combined approach (synthetic + human with oversampling) beats each method individually
  • ToS: can't distill into a competing LLM service; internal use is allowed

What's Next

A distilled model needs to be deployed. The next lessons cover local deployment and production serving.

  • Local LLM — Distilled GGUF → Ollama for local inference
  • Model Serving — Production deployment of a distilled model - TGI, vLLM, autoscaling
  • Cost Management — Distillation as a cost reduction strategy alongside caching and routing

Связанные уроки

  • aie-36-fine-tuning — Distillation builds on fine-tuning workflow
  • aie-37-open-source-models — Student models are usually open weights
  • aie-39-local-models — Small distilled models run locally
  • aie-40-model-serving — Smaller models serve faster and cheaper
  • ml-41-transfer-learning — Transfer teacher knowledge into a smaller student
  • ml-07
Model Distillation: Making a Small Model as Smart as a Large One

0

1

Sign In