AI Engineering

Model Distillation: Making a Small Model as Smart as a Large One

Цели урока

Understand knowledge distillation: teacher-student paradigm, soft labels, temperature scaling
Build the pipeline: seed generation → teacher inference → quality filtering → JSONL
Implement end-to-end distillation: GPT-4o → Llama 8B with QLoRA
Know when distillation beats fine-tuning and how to combine them

DeepSeek R1 distilled knowledge from a larger model - and got 90% quality at 10% inference cost. Distillation is when a large model teaches a small one to think the same way. GPT-4o-mini is a distillate of GPT-4. Stanford Alpaca 2023: USD 500 and one weekend - LLaMA 7B at GPT-3.5 level. Microsoft Orca: systematic distillation from GPT-4 produced a 13B model competing with ChatGPT. One engineer with a dinner budget can build a model that handles a specific production task better than the flagship.

GPT-4o-mini (2024) - officially a distillate of GPT-4, 16x cheaper with comparable quality on most tasks
DeepSeek R1 (2025) - cascaded distillation of reasoning from larger models: 90% quality, 10% cost
Stanford Alpaca: 52K examples from GPT-3.5 for USD 500 → LLaMA 7B at GPT-3.5 level
Enterprise pattern: distilled models handle 80% of requests, GPT-4o as fallback for the hard 20%

From Hinton to DeepSeek

**2015: Hinton, Vinyals, Dean** - "Distilling the Knowledge in a Neural Network". Soft labels with temperature scaling: instead of a hard label, the teacher passes a probability distribution. One idea - and the small model gets 10x more signal. **2023: Stanford Alpaca** - 52K synthetic examples for USD 500, LLaMA 7B competes with GPT-3.5. The era of mass LLM distillation opens. **2024: GPT-4o-mini** - OpenAI officially calls it a distillate. 16x cheaper than GPT-4o at 90%+ quality on standard tasks. **2025: DeepSeek R1** - cascaded distillation of reasoning. 671B → 70B → 8B. Each step retains 85-90% of the previous model's quality. In 10 years, distillation went from an academic technique to a core production AI strategy.

Предварительные знания

Knowledge Distillation: Transferring Knowledge Between Models

2015. Geoffrey Hinton publishes "Distilling the Knowledge in a Neural Network". The idea is simple and explosive: instead of training a small model on correct answers - train it to **think like a large one**. Teacher-student. The teacher (large model) passes the student (small model) not just "the correct class" but a probability distribution across all classes.

What's the difference? A hard label says: "this is a cat". A soft label from the teacher says: "cat 0.85, lynx 0.09, dog 0.04". These **soft probabilities** carry information about the structure of the problem - how similar the classes are to each other. The student gets 10x more signal from the same dataset. This is the magic of Hinton's distillation.

In 2024-2025, distillation reached a new level. GPT-4o-mini is a distillate of GPT-4. DeepSeek R1 distilled knowledge from larger reasoning models and achieved 90% quality at 10% inference cost. In the LLM context the mechanics differ: there's no access to a closed model's logits, so the student learns from **teacher-generated texts** via Supervised Fine-Tuning.

Why does distillation work? GPT-4o has ~1.8T parameters. Most of them encode "general knowledge" - grammar, logic, world facts. For a specific task (ticket classification, summarization), only a small fraction is needed. A small model can't store everything GPT-4o knows - but it can learn the patterns for one task. Distillation is the art of **squeezing what's needed from a large model into a small one**.

Metric	GPT-4o (teacher)	Llama 8B (base)	Llama 8B (distilled)	Savings
Accuracy	94%	72%	89%	-
Latency (p50)	800ms	80ms	80ms	10x
Cost per 1K req	USD 2.50	USD 0 (local)	USD 0 (local)	100%
Monthly (100K req/day)	USD 7,500	~USD 800 (GPU)	~USD 800 (GPU)	9x

**Real case: Stanford Alpaca (2023).** 52K examples generated by text-davinci-003 for USD 500. LLaMA 7B fine-tuned on this data showed quality comparable to GPT-3.5. One weekend, 500 dollars - and an open-source model competes with OpenAI's flagship. This opened the era of mass distillation.

What's the difference between LLM distillation (2024-2026) and classical knowledge distillation (Hinton 2015)?

Teacher-Student Pipeline: Generating Synthetic Data

The most critical part of distillation is generating high-quality training data. The principle is simple: ask the teacher the same questions the student will face in production, collect high-quality answers. Garbage in, garbage out. The student imitates the teacher - including its mistakes.

Volume	GPT-4o cost (generation)	GPT-4o-mini cost (filtering)	Time
1,000 examples	~USD 6	~USD 0.70	~15 min
5,000 examples	~USD 30	~USD 3.40	~1 hr
10,000 examples	~USD 60	~USD 6.80	~2 hrs
50,000 examples	~USD 300	~USD 34	~10 hrs

**Terms of Service.** OpenAI Terms prohibit using GPT-4 output to train **competing** models. A specialized model for internal use is allowed. Always check the provider's ToS before distillation.

The teacher (GPT-4o) generated 5,000 responses. 15% contain errors. What happens if the student trains without filtering?

End-to-End Pipeline: From GPT-4o to Llama 8B

Full pipeline - from task definition to deployed student model. Example: classifying support tickets into 12 categories. GPT-4o processes one ticket for ~USD 0.003. At 100K requests per day that's USD 300 per day - USD 9,000/month. A distilled Llama 8B on one A100 GPU (USD 2 per hour), with 50ms latency, handles the same load for USD 1,440/month. At 89% accuracy vs 94%.

**Step 3: Fine-tuning the student** with QLoRA via Unsloth:

In the pipeline, synthetic data comes from GPT-4o, and the student is fine-tuned via QLoRA on Llama 8B. Why QLoRA?

When Distillation Beats Fine-Tuning and Vice Versa

Distillation and fine-tuning are different tools. Fine-tuning uses **real data** (human-labeled), distillation uses **synthetic data** from the teacher. The choice depends on resources and the task. Neither approach always wins - but there's a clear decision framework.

Criterion	Fine-tuning (real data)	Distillation (synthetic)
Data requirements	Labeled examples from humans	Only teacher + prompts
Data cost	USD 0.10-5/example (annotation)	~USD 0.003 per example (API)
Quality ceiling	Limited by human labels	Limited by teacher model
Iteration speed	Slow (annotation)	Fast (hours)
Domain expertise	Need domain experts	Teacher must understand the domain
Scaling	Expensive (more annotators)	Cheap (more API calls)
Unique tasks	Better (experts know nuances)	Worse (teacher may not know)

The **combined approach** often yields the best results:

Approach	Accuracy	Cost	Time to deploy
GPT-4o zero-shot	94%	USD 0 (pay per use)	1 day
Fine-tuning only (300 human)	85%	~USD 1,600	3 weeks
Distillation only (5K synthetic)	89%	~USD 80	3 days
Combined (5K + 300 human)	92%	~USD 1,680	3 weeks
Combined + alignment stage	93%	~USD 1,700	3.5 weeks

**Progressive Distillation** - an advanced technique: GPT-4o → Llama 70B → Llama 8B. The intermediate model generates more "relevant" training data for the small model because it's architecturally closer. DeepSeek R1 used a similar cascading approach.

Task: summarizing medical reports. There are 200 expert-labeled examples and a budget for the GPT-4o API. Best approach?

Distillation is just fine-tuning on teacher outputs

Classical distillation (Hinton 2015) is fundamentally different: the student learns from soft labels with temperature scaling, not from correct answers

A hard label "cat" carries 1 bit of information. A soft label "cat: 0.85, lynx: 0.09, tiger: 0.04" carries information about the **structure of the class space** - how similar the classes are to each other. Temperature scaling (T=4) softens the distribution further, making small probabilities significant. The student gains knowledge about semantic proximity between concepts - knowledge that's completely lost in hard labels. In the LLM context, when there's no access to logits, SFT on texts is used - this is closer to imitation than to classical distillation. GPT-4o-mini was trained with access to soft labels - hence its advantage over naive SFT.

Key Takeaways

Hinton 2015: soft labels with temperature carry task structure, not just the correct answer
LLM distillation (2024+): teacher generates synthetic data → student fine-tuned via SFT
GPT-4o-mini is a distillate of GPT-4; DeepSeek R1 uses cascaded reasoning distillation
Pipeline: seed generation → teacher inference (temperature=0) → quality filtering → JSONL
Student achieves 90-95% quality with 10x latency reduction and USD 0 inference cost
Combined approach (synthetic + human with oversampling) beats each method individually
ToS: can't distill into a competing LLM service; internal use is allowed

What's Next

A distilled model needs to be deployed. The next lessons cover local deployment and production serving.

Local LLM — Distilled GGUF → Ollama for local inference
Model Serving — Production deployment of a distilled model - TGI, vLLM, autoscaling
Cost Management — Distillation as a cost reduction strategy alongside caching and routing

Связанные уроки

aie-36-fine-tuning — Distillation builds on fine-tuning workflow
aie-37-open-source-models — Student models are usually open weights
aie-39-local-models — Small distilled models run locally
aie-40-model-serving — Smaller models serve faster and cheaper
ml-41-transfer-learning — Transfer teacher knowledge into a smaller student
ml-07