AI Engineering
Model Distillation: Making a Small Model as Smart as a Large One
Цели урока
- Understand knowledge distillation: teacher-student paradigm, soft labels, temperature scaling
- Build the pipeline: seed generation → teacher inference → quality filtering → JSONL
- Implement end-to-end distillation: GPT-4o → Llama 8B with QLoRA
- Know when distillation beats fine-tuning and how to combine them
DeepSeek R1 distilled knowledge from a larger model - and got 90% quality at 10% inference cost. Distillation is when a large model teaches a small one to think the same way. GPT-4o-mini is a distillate of GPT-4. Stanford Alpaca 2023: USD 500 and one weekend - LLaMA 7B at GPT-3.5 level. Microsoft Orca: systematic distillation from GPT-4 produced a 13B model competing with ChatGPT. One engineer with a dinner budget can build a model that handles a specific production task better than the flagship.
- GPT-4o-mini (2024) - officially a distillate of GPT-4, 16x cheaper with comparable quality on most tasks
- DeepSeek R1 (2025) - cascaded distillation of reasoning from larger models: 90% quality, 10% cost
- Stanford Alpaca: 52K examples from GPT-3.5 for USD 500 → LLaMA 7B at GPT-3.5 level
- Enterprise pattern: distilled models handle 80% of requests, GPT-4o as fallback for the hard 20%
From Hinton to DeepSeek
**2015: Hinton, Vinyals, Dean** - "Distilling the Knowledge in a Neural Network". Soft labels with temperature scaling: instead of a hard label, the teacher passes a probability distribution. One idea - and the small model gets 10x more signal. **2023: Stanford Alpaca** - 52K synthetic examples for USD 500, LLaMA 7B competes with GPT-3.5. The era of mass LLM distillation opens. **2024: GPT-4o-mini** - OpenAI officially calls it a distillate. 16x cheaper than GPT-4o at 90%+ quality on standard tasks. **2025: DeepSeek R1** - cascaded distillation of reasoning. 671B → 70B → 8B. Each step retains 85-90% of the previous model's quality. In 10 years, distillation went from an academic technique to a core production AI strategy.
Предварительные знания
Knowledge Distillation: Transferring Knowledge Between Models
2015. Geoffrey Hinton publishes "Distilling the Knowledge in a Neural Network". The idea is simple and explosive: instead of training a small model on correct answers - train it to **think like a large one**. Teacher-student. The teacher (large model) passes the student (small model) not just "the correct class" but a probability distribution across all classes.
What's the difference? A hard label says: "this is a cat". A soft label from the teacher says: "cat 0.85, lynx 0.09, dog 0.04". These **soft probabilities** carry information about the structure of the problem - how similar the classes are to each other. The student gets 10x more signal from the same dataset. This is the magic of Hinton's distillation.
In 2024-2025, distillation reached a new level. GPT-4o-mini is a distillate of GPT-4. DeepSeek R1 distilled knowledge from larger reasoning models and achieved 90% quality at 10% inference cost. In the LLM context the mechanics differ: there's no access to a closed model's logits, so the student learns from **teacher-generated texts** via Supervised Fine-Tuning.
Why does distillation work? GPT-4o has ~1.8T parameters. Most of them encode "general knowledge" - grammar, logic, world facts. For a specific task (ticket classification, summarization), only a small fraction is needed. A small model can't store everything GPT-4o knows - but it can learn the patterns for one task. Distillation is the art of **squeezing what's needed from a large model into a small one**.
| Metric | GPT-4o (teacher) | Llama 8B (base) | Llama 8B (distilled) | Savings |
|---|---|---|---|---|
| Accuracy | 94% | 72% | 89% | - |
| Latency (p50) | 800ms | 80ms | 80ms | 10x |
| Cost per 1K req | USD 2.50 | USD 0 (local) | USD 0 (local) | 100% |
| Monthly (100K req/day) | USD 7,500 | ~USD 800 (GPU) | ~USD 800 (GPU) | 9x |
**Real case: Stanford Alpaca (2023).** 52K examples generated by text-davinci-003 for USD 500. LLaMA 7B fine-tuned on this data showed quality comparable to GPT-3.5. One weekend, 500 dollars - and an open-source model competes with OpenAI's flagship. This opened the era of mass distillation.
What's the difference between LLM distillation (2024-2026) and classical knowledge distillation (Hinton 2015)?
Teacher-Student Pipeline: Generating Synthetic Data
The most critical part of distillation is generating high-quality training data. The principle is simple: ask the teacher the same questions the student will face in production, collect high-quality answers. Garbage in, garbage out. The student imitates the teacher - including its mistakes.
| Volume | GPT-4o cost (generation) | GPT-4o-mini cost (filtering) | Time |
|---|---|---|---|
| 1,000 examples | ~USD 6 | ~USD 0.70 | ~15 min |
| 5,000 examples | ~USD 30 | ~USD 3.40 | ~1 hr |
| 10,000 examples | ~USD 60 | ~USD 6.80 | ~2 hrs |
| 50,000 examples | ~USD 300 | ~USD 34 | ~10 hrs |
**Terms of Service.** OpenAI Terms prohibit using GPT-4 output to train **competing** models. A specialized model for internal use is allowed. Always check the provider's ToS before distillation.
The teacher (GPT-4o) generated 5,000 responses. 15% contain errors. What happens if the student trains without filtering?
End-to-End Pipeline: From GPT-4o to Llama 8B
Full pipeline - from task definition to deployed student model. Example: classifying support tickets into 12 categories. GPT-4o processes one ticket for ~USD 0.003. At 100K requests per day that's USD 300 per day - USD 9,000/month. A distilled Llama 8B on one A100 GPU (USD 2 per hour), with 50ms latency, handles the same load for USD 1,440/month. At 89% accuracy vs 94%.
**Step 3: Fine-tuning the student** with QLoRA via Unsloth:
In the pipeline, synthetic data comes from GPT-4o, and the student is fine-tuned via QLoRA on Llama 8B. Why QLoRA?
When Distillation Beats Fine-Tuning and Vice Versa
Distillation and fine-tuning are different tools. Fine-tuning uses **real data** (human-labeled), distillation uses **synthetic data** from the teacher. The choice depends on resources and the task. Neither approach always wins - but there's a clear decision framework.
| Criterion | Fine-tuning (real data) | Distillation (synthetic) |
|---|---|---|
| Data requirements | Labeled examples from humans | Only teacher + prompts |
| Data cost | USD 0.10-5/example (annotation) | ~USD 0.003 per example (API) |
| Quality ceiling | Limited by human labels | Limited by teacher model |
| Iteration speed | Slow (annotation) | Fast (hours) |
| Domain expertise | Need domain experts | Teacher must understand the domain |
| Scaling | Expensive (more annotators) | Cheap (more API calls) |
| Unique tasks | Better (experts know nuances) | Worse (teacher may not know) |
The **combined approach** often yields the best results:
| Approach | Accuracy | Cost | Time to deploy |
|---|---|---|---|
| GPT-4o zero-shot | 94% | USD 0 (pay per use) | 1 day |
| Fine-tuning only (300 human) | 85% | ~USD 1,600 | 3 weeks |
| Distillation only (5K synthetic) | 89% | ~USD 80 | 3 days |
| Combined (5K + 300 human) | 92% | ~USD 1,680 | 3 weeks |
| Combined + alignment stage | 93% | ~USD 1,700 | 3.5 weeks |
**Progressive Distillation** - an advanced technique: GPT-4o → Llama 70B → Llama 8B. The intermediate model generates more "relevant" training data for the small model because it's architecturally closer. DeepSeek R1 used a similar cascading approach.
Task: summarizing medical reports. There are 200 expert-labeled examples and a budget for the GPT-4o API. Best approach?
Distillation is just fine-tuning on teacher outputs
Classical distillation (Hinton 2015) is fundamentally different: the student learns from soft labels with temperature scaling, not from correct answers
A hard label "cat" carries 1 bit of information. A soft label "cat: 0.85, lynx: 0.09, tiger: 0.04" carries information about the **structure of the class space** - how similar the classes are to each other. Temperature scaling (T=4) softens the distribution further, making small probabilities significant. The student gains knowledge about semantic proximity between concepts - knowledge that's completely lost in hard labels. In the LLM context, when there's no access to logits, SFT on texts is used - this is closer to imitation than to classical distillation. GPT-4o-mini was trained with access to soft labels - hence its advantage over naive SFT.
Key Takeaways
- Hinton 2015: soft labels with temperature carry task structure, not just the correct answer
- LLM distillation (2024+): teacher generates synthetic data → student fine-tuned via SFT
- GPT-4o-mini is a distillate of GPT-4; DeepSeek R1 uses cascaded reasoning distillation
- Pipeline: seed generation → teacher inference (temperature=0) → quality filtering → JSONL
- Student achieves 90-95% quality with 10x latency reduction and USD 0 inference cost
- Combined approach (synthetic + human with oversampling) beats each method individually
- ToS: can't distill into a competing LLM service; internal use is allowed
What's Next
A distilled model needs to be deployed. The next lessons cover local deployment and production serving.
- Local LLM — Distilled GGUF → Ollama for local inference
- Model Serving — Production deployment of a distilled model - TGI, vLLM, autoscaling
- Cost Management — Distillation as a cost reduction strategy alongside caching and routing
Связанные уроки
- aie-36-fine-tuning — Distillation builds on fine-tuning workflow
- aie-37-open-source-models — Student models are usually open weights
- aie-39-local-models — Small distilled models run locally
- aie-40-model-serving — Smaller models serve faster and cheaper
- ml-41-transfer-learning — Transfer teacher knowledge into a smaller student
- ml-07