AI Engineering

Synthetic Data: GPT-4 Trains Llama - How Data Distillation Works

Цели урока

  • Understand why real labeled data is always scarce and what it costs
  • Learn Self-Instruct and Evol-Instruct as dataset generation methods
  • See how to filter synthetic data: deduplication, LLM-as-judge, diversity metrics
  • Build a production pipeline with Distilabel and fine-tuning formatting

Alpaca (Stanford, 2023): 52,000 instruction-following examples generated for USD 500. LLaMA-7B fine-tuned on this data was indistinguishable from text-davinci-003 in blind evaluation. WizardLM with Evol-Instruct beat ChatGPT. Phi-1 with 1.3B parameters outperformed 13B models - on purely synthetic data. The scaling law principle was turned on its head: data quality matters more than model size.

  • Alpaca (Stanford) - first public distillation: USD 500 for generation, result comparable to GPT-3.5
  • WizardLM-70B with Evol-Instruct outperformed ChatGPT on the Vicuna benchmark in 2023
  • Phi-1.5 (Microsoft) - 1.3B parameters, synthetic data, outperforms 13B models on coding tasks
  • Meta Llama 3 uses synthetically generated data as a substantial portion of its training set

From Web Crawl to Synthetic Curricula

**2022 - Self-Instruct** (Wang et al., ACL 2023): bootstrapping from 175 seed tasks, instruction following without RLHF. **March 2023 - Alpaca** (Stanford): 52K examples for USD 500 via text-davinci-003, first public open-source distillation. **April 2023 - WizardLM** (Microsoft): Evol-Instruct with five complexity operators, 70B model surpasses ChatGPT. **June 2023 - Phi-1** (Microsoft Research): 1.3B parameters on 'textbook-quality' synthetics outperforms 13B models. **2024 - Distilabel 1.0**: production-grade framework for synthetic pipelines. **2025**: synthetic data is a standard component of all frontier model training, including Llama 3 and Gemini.

Предварительные знания

  • Fine-tuning: The Last Resort, Not the First - Training a Model on Custom Data

Why Real Data Is Always Scarce

Fine-tuning requires data. Domain data is scarce. Labeling is expensive. This is the fundamental triangle: a model is needed for a narrow domain, data is unavailable, annotators are costly. The solution came from an unexpected direction: ask a strong model to generate data for training a weaker one.

Concrete numbers: Llama 2 Chat required roughly 40,000 question-answer pairs with RLHF annotation. Each pair represents human annotator effort: USD 1-2 per example. Total USD 40,000-80,000 just for the instruction-following dataset, before preference data. For most companies - out of reach.

Data TypeLabeling CostAvailabilityAlternative
Instruction followingUSD 1-2 / exampleModerateSelf-Instruct generation
RLHF preference pairsUSD 3-5 / pairLowConstitutional AI
Domain Q&AUSD 2-5 / questionLow (no experts)GPT-4 distillation
Code reviewsUSD 5-20 / reviewVery lowEvol-Instruct + code LLM

Phi-1 (Microsoft, 2023) - a 1.3B parameter model trained on 'textbook-quality' synthetic data. On coding benchmarks it outperformed CodeLlama 13B and GPT-3.5-turbo. The researchers' conclusion: **data quality matters more than data volume or model size**. This reshaped the industry's approach to fine-tuning.

Alpaca (Stanford, 2023) was the first public distillation demonstration: 52,000 instruction-following examples generated by GPT-3 text-davinci-003 for USD 500. Fine-tuned LLaMA-7B behaved as an instruction-following model, indistinguishable from text-davinci-003 in blind evaluations. The data distillation era had begun.

The key finding from Microsoft's Phi-1 research on synthetic data was:

Self-Instruct and Evol-Instruct: Bootstrapping Through LLMs

**Self-Instruct** (Wang et al., 2022) is a bootstrap method: start with a small seed set of tasks (175 examples) and ask an LLM to generate new tasks in the same spirit. Each new task is checked for dissimilarity to existing ones (ROUGE similarity below 0.7), and low-quality outputs are filtered out.

**Evol-Instruct** (Xu et al., WizardLM, 2023) uses evolution: take an existing instruction and make it harder. Five complexity operators: add constraints, increase reasoning, concretize, increase input length, complicate. Result: WizardLM-70B outperformed ChatGPT (GPT-3.5-turbo) on the Vicuna benchmark.

WizardCoder (2023, WizardLM team) applied Evol-Instruct to code generation: StarCoder 15B with evolved data scored 57.3% on HumanEval, beating GPT-3.5 (48.1%) and Claude-v1 (56.0%). A model 4x smaller than GPT-3.5 - but with better data.

What distinguishes Evol-Instruct from Self-Instruct?

Quality vs Quantity: How to Avoid Generating Garbage

One million synthetic examples sounds good. In practice, without filtering that is 60-80% garbage: repeated patterns, wrong answers, overly easy tasks, generator hallucinations. A model trained on such a dataset learns to be a confident garbage producer.

**Phi-1 and 'textbook quality'**: Microsoft filtered 6 trillion Common Crawl tokens down to 7 billion 'textbook-quality' tokens using an LLM classifier. Then synthetically generated 1 billion more. Result: a 1.3B model outperformed 13B models. Quality beats quantity.

Diversity metrics: after filtering, check diversity via embedding clustering. If 70% of the dataset clusters into 3 groups, the data is homogeneous and the model will train on a narrow distribution. Tool: Argilla + UMAP visualization of dataset embeddings.

Why did Microsoft Phi-1 outperform models 10x its size using synthetic data?

Synthetic Data Pipeline in Practice

The full pipeline: seed examples -> generation -> filtering -> deduplication -> formatting for fine-tuning. Each stage affects the final model quality.

**Ecosystem tools** for production-grade synthetic data:

ToolPurposeStrengths
Distilabel (Argilla)Pipeline for synthetic data with built-in filteringDeclarative API, supports any LLM
ArgillaAnnotation + review of synthetic dataUI for human review, HuggingFace integration
LabelStudioUniversal annotation toolSupports QA, NER, preferences - all in one
DataDreamerSynthetic dataset generation frameworkResearch-oriented, reproducible

Distillation from OpenAI/Anthropic API is prohibited by their Terms of Service (using GPT-4 output to train a competing model is not allowed). For commercial use: generate data via open-source models (Llama 3, Mixtral) or obtain a license. Meta permits using Llama for distillation into other models under the Meta Community License terms.

Why is it not permitted to use GPT-4 output to train a commercial model?

Synthetic data is worse than real data - models learn the generator's hallucinations

With proper filtering, synthetic data outperforms unfiltered real data in training efficiency

Phi-1 and WizardLM demonstrated this empirically. Key conditions: a strong generator (GPT-4 or Llama 70B), multi-layer filtering, diversity control. Unfiltered synthetic data is indeed harmful - which is why filtering is critical.

More data equals a better model; millions of examples must be generated

10,000 high-quality examples often outperform 1,000,000 low-quality ones

LIMA (Less Is More for Alignment, 2023): 1,000 high-quality instruction-following examples produced a model comparable to an RLHF-trained Llama on 50x more data. Quality over quantity is not a metaphor - it is a measured effect.

Key Takeaways

  • Manual data labeling costs USD 1-5 per example - LLM distillation is 100x cheaper
  • Self-Instruct generates new tasks from seed; Evol-Instruct makes existing ones harder via 5 operators
  • Filtering quality matters more than volume: deduplication + LLM-as-judge + diversity metrics
  • Phi-1 (1.3B) outperforms 13B models on synthetics - quality beats scale
  • Commercial restriction: OpenAI/Anthropic ToS prohibits using their output to train competing models

Вопросы для размышления

  • For what task in the current project could a synthetic dataset be created? What seed set would be needed to start?
  • How can one verify that a synthetically trained model has not overfit to the generator's distribution?
  • If GPT-4 cannot be used for generation due to ToS, which open-source alternatives would suit the domain?

Related Topics

Synthetic data is the fuel for fine-tuning. The next question is how to deploy and evaluate the trained model.

  • Fine-tuning — Applying synthetic data - the full fine-tuning cycle
  • Local Models — Distilled models are often run locally for cost efficiency
  • Evaluation — How to measure the quality of a model trained on synthetic data

Связанные уроки

  • aie-36-fine-tuning
  • aie-31-evaluation
  • aie-39-local-models
  • aie-40-model-serving
  • aie-65-alignment-rlhf-dpo
Synthetic Data: GPT-4 Trains Llama - How Data Distillation Works

0

1

Sign In