AI Engineering

Synthetic Data: GPT-4 Trains Llama - How Data Distillation Works

Цели урока

Understand why real labeled data is always scarce and what it costs
Learn Self-Instruct and Evol-Instruct as dataset generation methods
See how to filter synthetic data: deduplication, LLM-as-judge, diversity metrics
Build a production pipeline with Distilabel and fine-tuning formatting

Alpaca (Stanford, 2023): 52,000 instruction-following examples generated for USD 500. LLaMA-7B fine-tuned on this data was indistinguishable from text-davinci-003 in blind evaluation. WizardLM with Evol-Instruct beat ChatGPT. Phi-1 with 1.3B parameters outperformed 13B models - on purely synthetic data. The scaling law principle was turned on its head: data quality matters more than model size.

Alpaca (Stanford) - first public distillation: USD 500 for generation, result comparable to GPT-3.5
WizardLM-70B with Evol-Instruct outperformed ChatGPT on the Vicuna benchmark in 2023
Phi-1.5 (Microsoft) - 1.3B parameters, synthetic data, outperforms 13B models on coding tasks
Meta Llama 3 uses synthetically generated data as a substantial portion of its training set

From Web Crawl to Synthetic Curricula

**2022 - Self-Instruct** (Wang et al., ACL 2023): bootstrapping from 175 seed tasks, instruction following without RLHF. **March 2023 - Alpaca** (Stanford): 52K examples for USD 500 via text-davinci-003, first public open-source distillation. **April 2023 - WizardLM** (Microsoft): Evol-Instruct with five complexity operators, 70B model surpasses ChatGPT. **June 2023 - Phi-1** (Microsoft Research): 1.3B parameters on 'textbook-quality' synthetics outperforms 13B models. **2024 - Distilabel 1.0**: production-grade framework for synthetic pipelines. **2025**: synthetic data is a standard component of all frontier model training, including Llama 3 and Gemini.

Предварительные знания

Fine-tuning: The Last Resort, Not the First - Training a Model on Custom Data

Why Real Data Is Always Scarce

Fine-tuning requires data. Domain data is scarce. Labeling is expensive. This is the fundamental triangle: a model is needed for a narrow domain, data is unavailable, annotators are costly. The solution came from an unexpected direction: ask a strong model to generate data for training a weaker one.

Concrete numbers: Llama 2 Chat required roughly 40,000 question-answer pairs with RLHF annotation. Each pair represents human annotator effort: USD 1-2 per example. Total USD 40,000-80,000 just for the instruction-following dataset, before preference data. For most companies - out of reach.

Data Type	Labeling Cost	Availability	Alternative
Instruction following	USD 1-2 / example	Moderate	Self-Instruct generation
RLHF preference pairs	USD 3-5 / pair	Low	Constitutional AI
Domain Q&A	USD 2-5 / question	Low (no experts)	GPT-4 distillation
Code reviews	USD 5-20 / review	Very low	Evol-Instruct + code LLM

Phi-1 (Microsoft, 2023) - a 1.3B parameter model trained on 'textbook-quality' synthetic data. On coding benchmarks it outperformed CodeLlama 13B and GPT-3.5-turbo. The researchers' conclusion: **data quality matters more than data volume or model size**. This reshaped the industry's approach to fine-tuning.

Alpaca (Stanford, 2023) was the first public distillation demonstration: 52,000 instruction-following examples generated by GPT-3 text-davinci-003 for USD 500. Fine-tuned LLaMA-7B behaved as an instruction-following model, indistinguishable from text-davinci-003 in blind evaluations. The data distillation era had begun.

The key finding from Microsoft's Phi-1 research on synthetic data was:

Self-Instruct and Evol-Instruct: Bootstrapping Through LLMs

**Self-Instruct** (Wang et al., 2022) is a bootstrap method: start with a small seed set of tasks (175 examples) and ask an LLM to generate new tasks in the same spirit. Each new task is checked for dissimilarity to existing ones (ROUGE similarity below 0.7), and low-quality outputs are filtered out.

**Evol-Instruct** (Xu et al., WizardLM, 2023) uses evolution: take an existing instruction and make it harder. Five complexity operators: add constraints, increase reasoning, concretize, increase input length, complicate. Result: WizardLM-70B outperformed ChatGPT (GPT-3.5-turbo) on the Vicuna benchmark.

WizardCoder (2023, WizardLM team) applied Evol-Instruct to code generation: StarCoder 15B with evolved data scored 57.3% on HumanEval, beating GPT-3.5 (48.1%) and Claude-v1 (56.0%). A model 4x smaller than GPT-3.5 - but with better data.

What distinguishes Evol-Instruct from Self-Instruct?

Quality vs Quantity: How to Avoid Generating Garbage

One million synthetic examples sounds good. In practice, without filtering that is 60-80% garbage: repeated patterns, wrong answers, overly easy tasks, generator hallucinations. A model trained on such a dataset learns to be a confident garbage producer.

**Phi-1 and 'textbook quality'**: Microsoft filtered 6 trillion Common Crawl tokens down to 7 billion 'textbook-quality' tokens using an LLM classifier. Then synthetically generated 1 billion more. Result: a 1.3B model outperformed 13B models. Quality beats quantity.

Diversity metrics: after filtering, check diversity via embedding clustering. If 70% of the dataset clusters into 3 groups, the data is homogeneous and the model will train on a narrow distribution. Tool: Argilla + UMAP visualization of dataset embeddings.

Why did Microsoft Phi-1 outperform models 10x its size using synthetic data?

Synthetic Data Pipeline in Practice

The full pipeline: seed examples -> generation -> filtering -> deduplication -> formatting for fine-tuning. Each stage affects the final model quality.

**Ecosystem tools** for production-grade synthetic data:

Tool	Purpose	Strengths
Distilabel (Argilla)	Pipeline for synthetic data with built-in filtering	Declarative API, supports any LLM
Argilla	Annotation + review of synthetic data	UI for human review, HuggingFace integration
LabelStudio	Universal annotation tool	Supports QA, NER, preferences - all in one
DataDreamer	Synthetic dataset generation framework	Research-oriented, reproducible

Distillation from OpenAI/Anthropic API is prohibited by their Terms of Service (using GPT-4 output to train a competing model is not allowed). For commercial use: generate data via open-source models (Llama 3, Mixtral) or obtain a license. Meta permits using Llama for distillation into other models under the Meta Community License terms.

Why is it not permitted to use GPT-4 output to train a commercial model?

Synthetic data is worse than real data - models learn the generator's hallucinations

With proper filtering, synthetic data outperforms unfiltered real data in training efficiency

Phi-1 and WizardLM demonstrated this empirically. Key conditions: a strong generator (GPT-4 or Llama 70B), multi-layer filtering, diversity control. Unfiltered synthetic data is indeed harmful - which is why filtering is critical.

More data equals a better model; millions of examples must be generated

10,000 high-quality examples often outperform 1,000,000 low-quality ones

LIMA (Less Is More for Alignment, 2023): 1,000 high-quality instruction-following examples produced a model comparable to an RLHF-trained Llama on 50x more data. Quality over quantity is not a metaphor - it is a measured effect.

Key Takeaways

Manual data labeling costs USD 1-5 per example - LLM distillation is 100x cheaper
Self-Instruct generates new tasks from seed; Evol-Instruct makes existing ones harder via 5 operators
Filtering quality matters more than volume: deduplication + LLM-as-judge + diversity metrics
Phi-1 (1.3B) outperforms 13B models on synthetics - quality beats scale
Commercial restriction: OpenAI/Anthropic ToS prohibits using their output to train competing models

Вопросы для размышления

For what task in the current project could a synthetic dataset be created? What seed set would be needed to start?
How can one verify that a synthetically trained model has not overfit to the generator's distribution?
If GPT-4 cannot be used for generation due to ToS, which open-source alternatives would suit the domain?

Связанные уроки

AI Engineering

Synthetic Data: GPT-4 Trains Llama - How Data Distillation Works

Цели урока

Understand why real labeled data is always scarce and what it costs
Learn Self-Instruct and Evol-Instruct as dataset generation methods
See how to filter synthetic data: deduplication, LLM-as-judge, diversity metrics
Build a production pipeline with Distilabel and fine-tuning formatting

Alpaca (Stanford) - first public distillation: USD 500 for generation, result comparable to GPT-3.5
WizardLM-70B with Evol-Instruct outperformed ChatGPT on the Vicuna benchmark in 2023
Phi-1.5 (Microsoft) - 1.3B parameters, synthetic data, outperforms 13B models on coding tasks
Meta Llama 3 uses synthetically generated data as a substantial portion of its training set

From Web Crawl to Synthetic Curricula

Предварительные знания

Fine-tuning: The Last Resort, Not the First - Training a Model on Custom Data

Why Real Data Is Always Scarce

Data Type	Labeling Cost	Availability	Alternative
Instruction following	USD 1-2 / example	Moderate	Self-Instruct generation
RLHF preference pairs	USD 3-5 / pair	Low	Constitutional AI
Domain Q&A	USD 2-5 / question	Low (no experts)	GPT-4 distillation
Code reviews	USD 5-20 / review	Very low	Evol-Instruct + code LLM

The key finding from Microsoft's Phi-1 research on synthetic data was:

Self-Instruct and Evol-Instruct: Bootstrapping Through LLMs

What distinguishes Evol-Instruct from Self-Instruct?

Quality vs Quantity: How to Avoid Generating Garbage

Why did Microsoft Phi-1 outperform models 10x its size using synthetic data?

Synthetic Data Pipeline in Practice

The full pipeline: seed examples -> generation -> filtering -> deduplication -> formatting for fine-tuning. Each stage affects the final model quality.

**Ecosystem tools** for production-grade synthetic data:

Tool	Purpose	Strengths
Distilabel (Argilla)	Pipeline for synthetic data with built-in filtering	Declarative API, supports any LLM
Argilla	Annotation + review of synthetic data	UI for human review, HuggingFace integration
LabelStudio	Universal annotation tool	Supports QA, NER, preferences - all in one
DataDreamer	Synthetic dataset generation framework	Research-oriented, reproducible

Why is it not permitted to use GPT-4 output to train a commercial model?

Synthetic data is worse than real data - models learn the generator's hallucinations

With proper filtering, synthetic data outperforms unfiltered real data in training efficiency

More data equals a better model; millions of examples must be generated

10,000 high-quality examples often outperform 1,000,000 low-quality ones

Key Takeaways

Manual data labeling costs USD 1-5 per example - LLM distillation is 100x cheaper
Self-Instruct generates new tasks from seed; Evol-Instruct makes existing ones harder via 5 operators
Filtering quality matters more than volume: deduplication + LLM-as-judge + diversity metrics
Phi-1 (1.3B) outperforms 13B models on synthetics - quality beats scale
Commercial restriction: OpenAI/Anthropic ToS prohibits using their output to train competing models

Вопросы для размышления

For what task in the current project could a synthetic dataset be created? What seed set would be needed to start?
How can one verify that a synthetically trained model has not overfit to the generator's distribution?
If GPT-4 cannot be used for generation due to ToS, which open-source alternatives would suit the domain?

Synthetic Data: GPT-4 Trains Llama - How Data Distillation Works

Цели урока

From Web Crawl to Synthetic Curricula

Предварительные знания

Why Real Data Is Always Scarce

Self-Instruct and Evol-Instruct: Bootstrapping Through LLMs

Quality vs Quantity: How to Avoid Generating Garbage

Synthetic Data Pipeline in Practice

Key Takeaways

Вопросы для размышления

Related Topics

Связанные уроки

Synthetic Data: GPT-4 Trains Llama - How Data Distillation Works

Цели урока

From Web Crawl to Synthetic Curricula

Предварительные знания

Why Real Data Is Always Scarce

Self-Instruct and Evol-Instruct: Bootstrapping Through LLMs

Quality vs Quantity: How to Avoid Generating Garbage

Synthetic Data Pipeline in Practice

Key Takeaways

Вопросы для размышления

Related Topics

Связанные уроки