AI Engineering

Fine-tuning: The Last Resort, Not the First - Training a Model on Custom Data

Цели урока

Determine when fine-tuning is justified versus prompt engineering or RAG
Prepare and validate training data in JSONL format
Run fine-tuning via the OpenAI API (USD 8 per 1M tokens for gpt-4o-mini) and monitor the process
Evaluate fine-tuned model quality through automated metrics and LLM-as-Judge
Understand LoRA (Hu et al. 2022) and QLoRA (Dettmers 2023) - efficient fine-tuning of open-source models on consumer GPUs

Fine-tuning is the last resort, not the first. RAG is cheaper, prompt engineering is faster. Fine-tuning is for when specific style, format, or behavior can't be achieved through a prompt. And the price range: from USD 50 (LoRA on llama-3-8b, 2 hours on an RTX 4090) to USD 100K+ (full GPT-4 fine-tune). A 2000x gap. One startup cut LLM expenses from USD 47,000 to USD 6,000 per month - just by switching to a fine-tuned mini model instead of gpt-4o.

Bloomberg fine-tuned an LLM on 50 years of financial data - BloombergGPT outperforms GPT-4 on financial tasks at a smaller model size
Stripe fine-tuned a model for fraud detection - 30% precision improvement at the same recall, with no latency increase
LoRA (Hu et al. 2022) makes it possible to fine-tune a 70B model on a single A100 - something that previously required a cluster of 8 machines
OpenAI reports: 40% of enterprise clients use fine-tuning of gpt-4o-mini to reduce costs while maintaining quality

LoRA and the Fine-tuning Revolution

**2022**: Edward Hu et al. publish LoRA: Low-Rank Adaptation of Large Language Models. The key insight - a weight matrix doesn't need to be updated in full. Two small low-rank matrices via rank decomposition are enough. Trainable parameters drop to 0.5% of the model. Fine-tuning a 7B model on a single GPU goes from impossible to routine. **2023**: Dettmers et al. publish QLoRA - the same idea plus 4-bit quantization of the base model. A 70B model fits in 48 GB VRAM. Llama 3.1 8B fits on a single RTX 4090 in 2 hours. The entire open-source ecosystem restructures around PEFT: Unsloth, Axolotl, LLaMA-Factory. Fine-tuning is no longer a BigTech exclusive.

Предварительные знания

When Fine-tuning Is Justified: Decision Framework

Fine-tuning is the last resort, not the first. RAG is cheaper. Prompt engineering is faster. Few-shot covers 80% of cases. But there are tasks where a prompt physically can't do the job: enforcing JSON format compliance at 100%, corporate tone of voice at scale, classifying into 47 categories with examples that don't fit the context window. That's when fine-tuning earns its place.

The cost question: LoRA on llama-3-8b runs about USD 50 and 2 hours on an RTX 4090. Full GPT-4 fine-tuning - USD 100K+. A gap of 2000x. So before launching a job, it's worth confirming the problem can't be solved at a lower level.

**Fine-tuning solves three categories of tasks** that prompt engineering cannot cover:

**Style and format** - the model must generate responses in a strictly defined format (JSON schema, tone of voice, corporate style). Few-shot works but is unstable with complex formats: sometimes it adds an extra field, sometimes it drops a required one.
**Specialized knowledge** - medical terminology, legal language, a company's internal jargon. RAG provides facts, but the model doesn't truly "understand" the domain - it doesn't know that "force majeure" in a contract means a specific legal construct.
**Latency and cost** - a fine-tuned GPT-4o-mini can replace GPT-4o with few-shot, delivering comparable quality at 10-20x cost reduction. At 100K requests per day, that's thousands of dollars in difference.

Scenario	Approach	Why
Answering FAQ from documentation	RAG	Up-to-date facts are needed, not model retraining
Model always responds in JSON with exact schema	Fine-tuning	Format is locked into the weights, 100% compliance
Classifying tickets into 47 categories	Fine-tuning	Few-shot can't fit 47 examples into a prompt
Chatbot knows about the product	RAG + few-shot	Product data changes, fine-tuning will go stale
Model writes like a lawyer	Fine-tuning	Style and terminology are patterns in the weights
Summarize in 2 sentences	Few-shot	2-3 examples in the prompt are sufficient

**Anti-pattern: fine-tuning for facts.** If a model needs to "know" current information (prices, availability, schedules) - that's RAG, not fine-tuning. A fine-tuned model doesn't update when data changes, and retraining costs money and time.

Fine-tuning means training a model from scratch on custom data

Data Preparation: JSONL Format and Best Practices

Fine-tuning quality is 80% determined by data quality. Not quantity - quality. The model learns exactly what's in the training set: including errors, inconsistencies, and bad patterns. Garbage in - garbage into the weights.

OpenAI, Anthropic, and other providers use the **JSONL format** (JSON Lines) - each line of the file is one training example in the chat completion format:

**Data preparation script** - automating conversion and validation:

**Data volume recommendations** depend on the task:

Task	Minimum Examples	Recommended	Note
Classification (2-5 classes)	50	200-500	Class balance is critical
Classification (10+ classes)	200	500-2000	At least 20 examples per class
Format generation	100	500-1000	Diversity of input data
Tone of voice / style	200	1000+	The subtler the style, the more examples needed
Summarization	50	300-500	Longer examples cost more in tokens

**The 10x rule:** if after a 10x increase in dataset size the metrics don't improve - the problem is not data quantity. Most likely the quality of examples needs improvement, more diversity is needed, or the task needs to be reconsidered.

**Common mistakes in data preparation:**

**Monotonous examples** - 500 examples of one pattern are worse than 100 diverse ones. The model will overfit to a single template and break on any deviation.
**Inconsistent labels** - if in some examples an email is classified as "spam" while a similar one is "marketing" - the model won't learn the rule, it will memorize the noise.
**Test set leakage** - if validation examples end up in the training set, metrics will be inflated, and the model won't generalize. This is a classic trap, easy to miss with auto-generated datasets.
**Too long system prompt** - the system message is duplicated in every example and charged during training. A shorter system prompt saves money.

A JSONL file for fine-tuning has 200 email classification examples. 190 of them are 'general', 5 are 'urgent', 5 are 'spam'. What will happen?

OpenAI Fine-tuning API: Practical Pipeline

OpenAI provides a managed fine-tuning service: no GPU, no infrastructure, no CUDA. Upload a JSONL file, start a job, and in 30-60 minutes there's a fine-tuned model with its own ID. Inference runs through the same Chat Completions API.

Training cost for gpt-4o-mini is USD 3 per million tokens. 500 examples at 500 tokens each, 3 epochs - about USD 2.25 for the entire job. Less than a cup of coffee for a model that then runs at USD 0.30/USD 1.20 per inference instead of USD 2.50/USD 10.00 for base gpt-4o.

**Complete fine-tuning pipeline** via OpenAI API:

**Fine-tuning cost** depends on the model and data volume:

Model	Training (per 1M tokens)	Inference Input	Inference Output	Example: 500 examples x 500 tokens, 3 epochs
gpt-4o-mini	USD 3.00	USD 0.30	USD 1.20	~USD 2.25
gpt-4o	USD 25.00	USD 3.75	USD 15.00	~USD 18.75
gpt-3.5-turbo	USD 8.00	USD 3.00	USD 6.00	~USD 6.00

**Fine-tuning economics:** a fine-tuned gpt-4o-mini costs USD 0.30/USD 1.20 for inference (input/output per 1M tokens). Regular gpt-4o costs USD 2.50/USD 10.00. If the fine-tuned mini delivers comparable quality - that's **8x savings on input and 8x on output** per call. At 100K requests per day, that's thousands of dollars per month.

**Rate limits for fine-tuned models** differ from base models. A new fine-tuned model starts with low limits (around 100 RPM). For production workloads, an increase must be requested through the OpenAI dashboard.

A fine-tuned gpt-4o-mini model was trained on 500 legal classification examples. How is it used in production?

Evaluation After Fine-tuning: How to Know the Model Improved

Training loss drops - that doesn't mean the model got better. Training loss measures how well the model memorized the training set. But the goal isn't memorization - it's generalization.

The evaluation pipeline must be set up **before** fine-tuning starts. Not after - there will be nothing to compare against. Three levels, from fast to precise:

**Practical evaluation pipeline** - automating Level 1 and Level 2:

**Key metrics to monitor:**

**Training loss vs Validation loss** - if training loss drops while validation loss rises, the model is overfitting. The best checkpoint is end of epoch 1, where validation loss is at its minimum.
**Format compliance rate** - what % of fine-tuned model responses match the expected format. Target: >98%.
**Regression on base tasks** - fine-tuning can degrade general capabilities. Test not only the target task but also general abilities.
**Latency** - fine-tuned models usually have the same latency, but it's worth checking, especially under high load.

**Catastrophic forgetting** - a phenomenon from the PEFT literature: fine-tuning on a narrow task can degrade the model in other areas. If a fine-tuned gpt-4o-mini stopped doing math correctly - that's it. Solution: add general task examples to the training set (10-20% of the volume).

A fine-tuning job's training loss steadily drops from 0.5 to 0.05 over 3 epochs. Validation loss dropped to 0.15 in the first epoch, then started rising to 0.3. What does this mean?

LoRA and QLoRA: Efficient Fine-tuning Without Full Retraining

Full fine-tuning updates **all** model weights. For Llama 3.1 70B - 70 billion parameters, 8x A100 80GB GPUs, weeks of training. Then in 2022, Edward Hu et al. published LoRA - and the rules changed.

**LoRA (Low-Rank Adaptation, Hu et al. 2022)** is a PEFT (Parameter-Efficient Fine-Tuning) method that trains only small "adapters" attached to frozen model weights. Instead of updating a 4096x4096 matrix (16M parameters), LoRA trains two small matrices 4096x16 and 16x4096 (131K parameters) via rank decomposition. That's **125x fewer** parameters - and 95-99% of full fine-tuning quality.

In 2023, Dettmers et al. went further: **QLoRA** = LoRA + 4-bit quantization of the base model. Llama 3.1 8B in 4-bit takes 3.5 GB instead of 28 GB. Add LoRA adapters in float16 - another 80 MB. Total: fine-tuning an 8B model fits on a single RTX 4090 in 2 hours.

**LoRA parameters** and their impact on results:

Parameter	Description	Typical Value	Impact
rank (r)	Adapter size (rank decomposition)	8-64	Higher r → more capacity, but slower
alpha	Scaling factor	16-128 (usually 2xr)	Controls the strength of adaptation
target_modules	Which layers to adapt	q_proj, v_proj	More layers → better quality, more VRAM
dropout	Regularization	0.05-0.1	Protection against overfitting

**Practical example: LoRA fine-tuning with Hugging Face PEFT** (frameworks like Unsloth and Axolotl simplify this further, but PEFT is the foundation):

**Comparison of fine-tuning approaches:**

Method	VRAM (7B model)	Trainable params	Speed	Quality
Full fine-tuning	~60 GB	100%	Slow	Best
LoRA (r=16)	~18 GB	0.5%	3-5x faster	95-99% of full
QLoRA (r=16, 4bit)	~6 GB	0.5%	2-3x faster	93-97% of full
OpenAI API fine-tuning	0 (managed)	Unknown	30-60 min

Fine-tuning = training from scratch - requires lots of data and GPUs

Fine-tuning adapts already pre-trained weights; LoRA/QLoRA does this with hundreds of examples on a single GPU

A pre-trained model already contains knowledge of language, logic, and formats - the result of billions of tokens of pre-training. Fine-tuning just nudges the weights in the right direction. With LoRA, only 0.5% of parameters are updated; with QLoRA the entire base model is compressed to 4-bit - leaving room for adapters. That's why thousands of examples aren't needed: 200-500 diverse ones are often enough.

Fine-tuning replaces RAG - a fine-tuned model knows everything it needs

Fine-tuning and RAG address fundamentally different problems and combine well

Fine-tuning changes how a model behaves: how it responds, in what format, with what style. RAG changes what it knows: it pulls current facts from a source into context. Train a model on a price list today - next month the prices are stale. RAG always fetches fresh data. The optimal architecture: RAG for facts, fine-tuning for the style of processing those facts.

Key Takeaways

Fine-tuning is the last resort: start with zero-shot, few-shot, RAG. Fine-tuning only when style/format/behavior can't be achieved with a prompt
Data quality matters more than quantity: 200 diverse examples beat 2,000 monotonous ones. Garbage in - garbage into the weights
OpenAI fine-tuning API: JSONL → upload → job → 30-60 min → model with the same API. USD 3 per 1M tokens for gpt-4o-mini
Evaluation is mandatory before launch: training loss != quality. Validation loss, format compliance, LLM-as-Judge
LoRA (Hu et al. 2022) trains 0.5% of parameters; QLoRA (Dettmers 2023) adds 4-bit - fine-tuning 8B on RTX 4090 for USD 50
Fine-tuning != RAG: one changes behavior, the other changes knowledge. The best systems combine both

What's Next

Fine-tuning is one tool for model customization. Next steps: open-source models for full control, distillation for extreme optimization, and local deployment via Ollama/vLLM.

Open Source Models — LoRA/QLoRA is applied to open-source models - Llama, Mistral, Qwen
Model Distillation — Distillation + fine-tuning = a small model with the quality of a large one
Local LLM — A fine-tuned open-source model can be run locally via Ollama/vLLM

Связанные уроки

aie-03-llm-fundamentals — Fine-tuning rests on model architecture basics
aie-31-evaluation — Eval proves fine-tuning improved quality
aie-37-open-source-models — Open weights are needed for full fine-tuning
aie-38-distillation — Distillation is a cheaper alternative path
ml-41-transfer-learning — Adapting a pretrained model to a new task
ml-07