AI Engineering
Fine-tuning: The Last Resort, Not the First - Training a Model on Custom Data
Цели урока
- Determine when fine-tuning is justified versus prompt engineering or RAG
- Prepare and validate training data in JSONL format
- Run fine-tuning via the OpenAI API (USD 8 per 1M tokens for gpt-4o-mini) and monitor the process
- Evaluate fine-tuned model quality through automated metrics and LLM-as-Judge
- Understand LoRA (Hu et al. 2022) and QLoRA (Dettmers 2023) - efficient fine-tuning of open-source models on consumer GPUs
Fine-tuning is the last resort, not the first. RAG is cheaper, prompt engineering is faster. Fine-tuning is for when specific style, format, or behavior can't be achieved through a prompt. And the price range: from USD 50 (LoRA on llama-3-8b, 2 hours on an RTX 4090) to USD 100K+ (full GPT-4 fine-tune). A 2000x gap. One startup cut LLM expenses from USD 47,000 to USD 6,000 per month - just by switching to a fine-tuned mini model instead of gpt-4o.
- Bloomberg fine-tuned an LLM on 50 years of financial data - BloombergGPT outperforms GPT-4 on financial tasks at a smaller model size
- Stripe fine-tuned a model for fraud detection - 30% precision improvement at the same recall, with no latency increase
- LoRA (Hu et al. 2022) makes it possible to fine-tune a 70B model on a single A100 - something that previously required a cluster of 8 machines
- OpenAI reports: 40% of enterprise clients use fine-tuning of gpt-4o-mini to reduce costs while maintaining quality
LoRA and the Fine-tuning Revolution
**2022**: Edward Hu et al. publish LoRA: Low-Rank Adaptation of Large Language Models. The key insight - a weight matrix doesn't need to be updated in full. Two small low-rank matrices via rank decomposition are enough. Trainable parameters drop to 0.5% of the model. Fine-tuning a 7B model on a single GPU goes from impossible to routine. **2023**: Dettmers et al. publish QLoRA - the same idea plus 4-bit quantization of the base model. A 70B model fits in 48 GB VRAM. Llama 3.1 8B fits on a single RTX 4090 in 2 hours. The entire open-source ecosystem restructures around PEFT: Unsloth, Axolotl, LLaMA-Factory. Fine-tuning is no longer a BigTech exclusive.
Предварительные знания
When Fine-tuning Is Justified: Decision Framework
Fine-tuning is the last resort, not the first. RAG is cheaper. Prompt engineering is faster. Few-shot covers 80% of cases. But there are tasks where a prompt physically can't do the job: enforcing JSON format compliance at 100%, corporate tone of voice at scale, classifying into 47 categories with examples that don't fit the context window. That's when fine-tuning earns its place.
The cost question: LoRA on llama-3-8b runs about USD 50 and 2 hours on an RTX 4090. Full GPT-4 fine-tuning - USD 100K+. A gap of 2000x. So before launching a job, it's worth confirming the problem can't be solved at a lower level.
**Fine-tuning solves three categories of tasks** that prompt engineering cannot cover:
- **Style and format** - the model must generate responses in a strictly defined format (JSON schema, tone of voice, corporate style). Few-shot works but is unstable with complex formats: sometimes it adds an extra field, sometimes it drops a required one.
- **Specialized knowledge** - medical terminology, legal language, a company's internal jargon. RAG provides facts, but the model doesn't truly "understand" the domain - it doesn't know that "force majeure" in a contract means a specific legal construct.
- **Latency and cost** - a fine-tuned GPT-4o-mini can replace GPT-4o with few-shot, delivering comparable quality at 10-20x cost reduction. At 100K requests per day, that's thousands of dollars in difference.
| Scenario | Approach | Why |
|---|---|---|
| Answering FAQ from documentation | RAG | Up-to-date facts are needed, not model retraining |
| Model always responds in JSON with exact schema | Fine-tuning | Format is locked into the weights, 100% compliance |
| Classifying tickets into 47 categories | Fine-tuning | Few-shot can't fit 47 examples into a prompt |
| Chatbot knows about the product | RAG + few-shot | Product data changes, fine-tuning will go stale |
| Model writes like a lawyer | Fine-tuning | Style and terminology are patterns in the weights |
| Summarize in 2 sentences | Few-shot | 2-3 examples in the prompt are sufficient |
**Anti-pattern: fine-tuning for facts.** If a model needs to "know" current information (prices, availability, schedules) - that's RAG, not fine-tuning. A fine-tuned model doesn't update when data changes, and retraining costs money and time.
Fine-tuning means training a model from scratch on custom data
Data Preparation: JSONL Format and Best Practices
Fine-tuning quality is 80% determined by data quality. Not quantity - quality. The model learns exactly what's in the training set: including errors, inconsistencies, and bad patterns. Garbage in - garbage into the weights.
OpenAI, Anthropic, and other providers use the **JSONL format** (JSON Lines) - each line of the file is one training example in the chat completion format:
**Data preparation script** - automating conversion and validation:
**Data volume recommendations** depend on the task:
| Task | Minimum Examples | Recommended | Note |
|---|---|---|---|
| Classification (2-5 classes) | 50 | 200-500 | Class balance is critical |
| Classification (10+ classes) | 200 | 500-2000 | At least 20 examples per class |
| Format generation | 100 | 500-1000 | Diversity of input data |
| Tone of voice / style | 200 | 1000+ | The subtler the style, the more examples needed |
| Summarization | 50 | 300-500 | Longer examples cost more in tokens |
**The 10x rule:** if after a 10x increase in dataset size the metrics don't improve - the problem is not data quantity. Most likely the quality of examples needs improvement, more diversity is needed, or the task needs to be reconsidered.
**Common mistakes in data preparation:**
- **Monotonous examples** - 500 examples of one pattern are worse than 100 diverse ones. The model will overfit to a single template and break on any deviation.
- **Inconsistent labels** - if in some examples an email is classified as "spam" while a similar one is "marketing" - the model won't learn the rule, it will memorize the noise.
- **Test set leakage** - if validation examples end up in the training set, metrics will be inflated, and the model won't generalize. This is a classic trap, easy to miss with auto-generated datasets.
- **Too long system prompt** - the system message is duplicated in every example and charged during training. A shorter system prompt saves money.
A JSONL file for fine-tuning has 200 email classification examples. 190 of them are 'general', 5 are 'urgent', 5 are 'spam'. What will happen?
OpenAI Fine-tuning API: Practical Pipeline
OpenAI provides a managed fine-tuning service: no GPU, no infrastructure, no CUDA. Upload a JSONL file, start a job, and in 30-60 minutes there's a fine-tuned model with its own ID. Inference runs through the same Chat Completions API.
Training cost for gpt-4o-mini is USD 3 per million tokens. 500 examples at 500 tokens each, 3 epochs - about USD 2.25 for the entire job. Less than a cup of coffee for a model that then runs at USD 0.30/USD 1.20 per inference instead of USD 2.50/USD 10.00 for base gpt-4o.
**Complete fine-tuning pipeline** via OpenAI API:
**Fine-tuning cost** depends on the model and data volume:
| Model | Training (per 1M tokens) | Inference Input | Inference Output | Example: 500 examples x 500 tokens, 3 epochs |
|---|---|---|---|---|
| gpt-4o-mini | USD 3.00 | USD 0.30 | USD 1.20 | ~USD 2.25 |
| gpt-4o | USD 25.00 | USD 3.75 | USD 15.00 | ~USD 18.75 |
| gpt-3.5-turbo | USD 8.00 | USD 3.00 | USD 6.00 | ~USD 6.00 |
**Fine-tuning economics:** a fine-tuned gpt-4o-mini costs USD 0.30/USD 1.20 for inference (input/output per 1M tokens). Regular gpt-4o costs USD 2.50/USD 10.00. If the fine-tuned mini delivers comparable quality - that's **8x savings on input and 8x on output** per call. At 100K requests per day, that's thousands of dollars per month.
**Rate limits for fine-tuned models** differ from base models. A new fine-tuned model starts with low limits (around 100 RPM). For production workloads, an increase must be requested through the OpenAI dashboard.
A fine-tuned gpt-4o-mini model was trained on 500 legal classification examples. How is it used in production?
Evaluation After Fine-tuning: How to Know the Model Improved
Training loss drops - that doesn't mean the model got better. Training loss measures how well the model memorized the training set. But the goal isn't memorization - it's generalization.
The evaluation pipeline must be set up **before** fine-tuning starts. Not after - there will be nothing to compare against. Three levels, from fast to precise:
**Practical evaluation pipeline** - automating Level 1 and Level 2:
**Key metrics to monitor:**
- **Training loss vs Validation loss** - if training loss drops while validation loss rises, the model is overfitting. The best checkpoint is end of epoch 1, where validation loss is at its minimum.
- **Format compliance rate** - what % of fine-tuned model responses match the expected format. Target: >98%.
- **Regression on base tasks** - fine-tuning can degrade general capabilities. Test not only the target task but also general abilities.
- **Latency** - fine-tuned models usually have the same latency, but it's worth checking, especially under high load.
**Catastrophic forgetting** - a phenomenon from the PEFT literature: fine-tuning on a narrow task can degrade the model in other areas. If a fine-tuned gpt-4o-mini stopped doing math correctly - that's it. Solution: add general task examples to the training set (10-20% of the volume).
A fine-tuning job's training loss steadily drops from 0.5 to 0.05 over 3 epochs. Validation loss dropped to 0.15 in the first epoch, then started rising to 0.3. What does this mean?
LoRA and QLoRA: Efficient Fine-tuning Without Full Retraining
Full fine-tuning updates **all** model weights. For Llama 3.1 70B - 70 billion parameters, 8x A100 80GB GPUs, weeks of training. Then in 2022, Edward Hu et al. published LoRA - and the rules changed.
**LoRA (Low-Rank Adaptation, Hu et al. 2022)** is a PEFT (Parameter-Efficient Fine-Tuning) method that trains only small "adapters" attached to frozen model weights. Instead of updating a 4096x4096 matrix (16M parameters), LoRA trains two small matrices 4096x16 and 16x4096 (131K parameters) via rank decomposition. That's **125x fewer** parameters - and 95-99% of full fine-tuning quality.
In 2023, Dettmers et al. went further: **QLoRA** = LoRA + 4-bit quantization of the base model. Llama 3.1 8B in 4-bit takes 3.5 GB instead of 28 GB. Add LoRA adapters in float16 - another 80 MB. Total: fine-tuning an 8B model fits on a single RTX 4090 in 2 hours.
**LoRA parameters** and their impact on results:
| Parameter | Description | Typical Value | Impact |
|---|---|---|---|
| rank (r) | Adapter size (rank decomposition) | 8-64 | Higher r → more capacity, but slower |
| alpha | Scaling factor | 16-128 (usually 2xr) | Controls the strength of adaptation |
| target_modules | Which layers to adapt | q_proj, v_proj | More layers → better quality, more VRAM |
| dropout | Regularization | 0.05-0.1 | Protection against overfitting |
**Practical example: LoRA fine-tuning with Hugging Face PEFT** (frameworks like Unsloth and Axolotl simplify this further, but PEFT is the foundation):
**Comparison of fine-tuning approaches:**
| Method | VRAM (7B model) | Trainable params | Speed | Quality |
|---|---|---|---|---|
| Full fine-tuning | ~60 GB | 100% | Slow | Best |
| LoRA (r=16) | ~18 GB | 0.5% | 3-5x faster | 95-99% of full |
| QLoRA (r=16, 4bit) | ~6 GB | 0.5% | 2-3x faster | 93-97% of full |
| OpenAI API fine-tuning | 0 (managed) | Unknown | 30-60 min |
Fine-tuning = training from scratch - requires lots of data and GPUs
Fine-tuning adapts already pre-trained weights; LoRA/QLoRA does this with hundreds of examples on a single GPU
A pre-trained model already contains knowledge of language, logic, and formats - the result of billions of tokens of pre-training. Fine-tuning just nudges the weights in the right direction. With LoRA, only 0.5% of parameters are updated; with QLoRA the entire base model is compressed to 4-bit - leaving room for adapters. That's why thousands of examples aren't needed: 200-500 diverse ones are often enough.
Fine-tuning replaces RAG - a fine-tuned model knows everything it needs
Fine-tuning and RAG address fundamentally different problems and combine well
Fine-tuning changes how a model behaves: how it responds, in what format, with what style. RAG changes what it knows: it pulls current facts from a source into context. Train a model on a price list today - next month the prices are stale. RAG always fetches fresh data. The optimal architecture: RAG for facts, fine-tuning for the style of processing those facts.
Key Takeaways
- Fine-tuning is the last resort: start with zero-shot, few-shot, RAG. Fine-tuning only when style/format/behavior can't be achieved with a prompt
- Data quality matters more than quantity: 200 diverse examples beat 2,000 monotonous ones. Garbage in - garbage into the weights
- OpenAI fine-tuning API: JSONL → upload → job → 30-60 min → model with the same API. USD 3 per 1M tokens for gpt-4o-mini
- Evaluation is mandatory before launch: training loss != quality. Validation loss, format compliance, LLM-as-Judge
- LoRA (Hu et al. 2022) trains 0.5% of parameters; QLoRA (Dettmers 2023) adds 4-bit - fine-tuning 8B on RTX 4090 for USD 50
- Fine-tuning != RAG: one changes behavior, the other changes knowledge. The best systems combine both
What's Next
Fine-tuning is one tool for model customization. Next steps: open-source models for full control, distillation for extreme optimization, and local deployment via Ollama/vLLM.
- Open Source Models — LoRA/QLoRA is applied to open-source models - Llama, Mistral, Qwen
- Model Distillation — Distillation + fine-tuning = a small model with the quality of a large one
- Local LLM — A fine-tuned open-source model can be run locally via Ollama/vLLM
Связанные уроки
- aie-03-llm-fundamentals — Fine-tuning rests on model architecture basics
- aie-31-evaluation — Eval proves fine-tuning improved quality
- aie-37-open-source-models — Open weights are needed for full fine-tuning
- aie-38-distillation — Distillation is a cheaper alternative path
- ml-41-transfer-learning — Adapting a pretrained model to a new task
- ml-07