AI Engineering

Open Source Models: Llama, Mistral, Qwen, Gemma - Choosing an Alternative to GPT

Цели урока

Navigate the open-source LLM landscape: Llama, Mistral, Qwen, Gemma, DeepSeek
Compare models using benchmarks and understand benchmark limitations
Understand licenses: Apache 2.0, MIT, Llama License - what's permitted commercially
Run open-source models via Ollama and integrate with TypeScript
Make the open-source vs closed API decision for a specific project

Llama 3.1 405B outperformed GPT-4 on several benchmarks. And shipped open-source. Meta spent billions - and gave it away for free. Six months later Llama became the foundation for hundreds of products. This isn't philanthropy - it's strategy: the more developers build on Llama, the stronger the ecosystem, the more fine-tune data flows back, the better the next version. OpenAI created the market. Meta made it open.

Meta: Llama downloaded 300+ million times, used by 50K+ companies - the largest open-source LLM release in history
Uber moved AI services to self-hosted Llama - saving USD 10M/year while maintaining quality
DeepSeek R1 - the first open-source reasoning model on par with o1, shipped 3 months after o1
EU AI Act encourages open-source: easier transparency, weight auditing, compliance without vendor lock-in
Chatbot Arena 2026: gap between top open-source (Llama 405B) and GPT-4o - under 50 Elo points

How open-source LLMs went from toy to production

**February 2023**: Meta releases Llama 1 - weights leak in 48 hours, community runs the model on a MacBook within a week. **July 2023**: Llama 2 - officially open, commercial-friendly, downloaded millions of times. **December 2023**: Mistral releases Mixtral 8x7B - the first MoE model competing with GPT-3.5 at three times lower compute. **April 2024**: Llama 3, gap with GPT-4o shrinks to 10-15%. **January 2025**: DeepSeek R1 - open-source reasoning at o1 level. **2026**: open-source LLMs are the standard for compliance-constrained industries and high-load systems.

Предварительные знания

How LLMs Work: Tokens, Embeddings, Attention

Open-source LLM Landscape in 2026

July 2023. Meta drops Llama 2. Within 48 hours - 100,000 access requests. Within a month the model is running on MacBooks, Raspberry Pis, and decade-old gaming GPUs. This isn't just a release - it's the moment the monopoly on production-grade LLMs ended.

In 2026, open-source LLMs are not "almost as good as GPT-4". Llama 3.1 405B, DeepSeek V3, and Qwen 2.5 72B compete with GPT-4o on Chatbot Arena by real user preferences. The gap has shrunk from a chasm to measurement noise.

Key families dominating in 2026:

Family	Company	Sizes	Key Feature
Llama 3.x / 4	Meta	8B, 70B, 405B	Most popular, huge fine-tune and tooling ecosystem
Mistral / Mixtral	Mistral AI	7B, 8x7B, 8x22B, Large	MoE architecture, strong reasoning, European company
Qwen 2.5 / 3	Alibaba	0.5B-72B	Best Chinese + multilingual support, strong coding
Gemma 2 / 3	Google	2B, 9B, 27B	Compact, efficient, good for edge deployment
DeepSeek V3 / R1	DeepSeek	7B, 67B, 671B (MoE)	State-of-the-art reasoning, MoE, open training details
Phi-3 / Phi-4	Microsoft	3.8B, 14B	Best in their size class, SLM (Small Language Models)
Command R+	Cohere	35B, 104B	Optimized for RAG and enterprise

The architectural breakthrough that changed the rules: **Mixture of Experts (MoE)**. The model doesn't use all parameters for every token - a router selects 2-4 "experts" from N. Mixtral 8x7B activates ~13B of 47B parameters. Speed of a small model, quality of a large one. That's why DeepSeek V3 with 671B parameters is cheaper to deploy than a dense Llama 405B.

The pace is staggering: every 3-4 months a new generation appears that outperforms the previous by 10-20% on benchmarks. A model that was state-of-the-art in January is behind 2-3 competitors by summer. **Hugging Face Hub** is the center of this ecosystem: 800,000+ models, each with a model card, benchmarks, and discussion threads. URL: huggingface.co/models.

Mixtral 8x7B has 47B total parameters. How many parameters are activated when processing a single token?

Comparing Models: Benchmarks and Real-World Performance

MMLU 87% vs 84% - Model A wins. Then in production on real data, it turns out Model B is twice as fast, cheaper, and follows instructions better. Benchmarks are a first-pass filter, not a verdict.

Benchmark	What It Tests	Format	Limitation
MMLU (5-shot)	General knowledge (57 subjects)	Multiple choice	Tests memorization, not reasoning
MMLU-Pro	Harder knowledge questions	10 answer options	Better than MMLU, but still multiple choice
HumanEval / MBPP	Code generation (Python)	Write a function	Python only, short functions
GSM8K	Math (school-level)	Word problems	Too easy for new models
MATH	Math (competition-level)	Formal problems	Good, but doesn't cover applied math
MT-Bench	Conversational ability	LLM-as-judge, GPT-4	Depends on judge model
Arena Elo (Chatbot Arena)	Real human preferences	A/B comparisons	Gold standard, but expensive and slow
IFEval	Instruction following	Strict format compliance	Critical for production

**Approximate ranking** (early 2026, changes with every release):

A smaller model can beat a larger one - not as an exception, but as the rule for specialized tasks:

**Specific domain** - a fine-tuned Llama 8B on medical data can outperform GPT-4o on medical tasks
**Latency** - an 8B model responds in 200ms, a 70B in 2s, GPT-4o in 1-5s. For realtime applications latency is critical
**Language** - Qwen 2.5 is significantly better than Llama on Chinese, Mistral is better on French
**Contamination** - some models "saw" benchmark tasks during training, which inflates their scores

A model should not be chosen based on a single benchmark. MMLU 85% for Model A and 83% for Model B - the difference is within the margin of error. Always test on production data: 50-100 real examples from a real task provide more insight than all benchmarks combined.

Open-source models are lower quality - otherwise why pay for GPT-4o?

Llama 3.1 405B and DeepSeek V3 are neck and neck with GPT-4o on Chatbot Arena. On specialized tasks, fine-tuned open-source often wins

Chatbot Arena measures real user preferences in blind comparisons. The gap between top open-source and GPT-4o shrank from 100+ Elo in 2023 to under 50 in 2026. On domain-specific tasks (medical records, legal documents, specific languages) fine-tuned Llama 8B regularly beats GPT-4o - specialization beats generalization.

Model A: MMLU=87%, HumanEval=82%. Model B: MMLU=84%, HumanEval=79%. What conclusion is correct?

Licenses: Llama License vs Apache 2.0 vs MIT

"Open-source" in the LLM context is an imprecise term. Most models have **open weights**, but that's not the same as open-source software. Apache 2.0 and MIT give near-unlimited freedom. Llama License is permissive, but with nuances. Knowing the difference matters before production deployment, not after.

License	Models	Commercial use	Fine-tuning	Restrictions
Apache 2.0	Mistral 7B, Qwen 2.5, Gemma	Yes	Yes	Minimal - standard open-source
MIT	Phi-3, Phi-4	Yes	Yes	Minimal, even more permissive than Apache
Llama License	Llama 3.x / 4	Yes (with restrictions)	Yes	700M MAU limit
DeepSeek License	DeepSeek V3/R1	Yes	Yes	Very permissive, similar to MIT
Cohere C4AI	Command R+	Not for >USD 1M revenue	Yes	Revenue limit
Proprietary API	GPT-4o, Claude	Via API	Via API	No access to weights, vendor lock-in

**Startup / SMB** - Llama License works fine. Apache 2.0 (Mistral, Qwen) is even simpler - no MAU restrictions
**Enterprise** - legal departments prefer Apache 2.0 or MIT. Llama License requires review
**Healthcare / Finance** - open-source on own hardware may be the only option for compliance
**EU AI Act** - open-source models are easier for compliance: weight auditing, deployment control

**"Open-source" does not equal "open training data".** Most models do NOT disclose training data. EU AI Act compliance may require transparency about training data. Exception - OLMo from AI2 (fully open: weights + data + code).

A startup (10K users) is building a SaaS. The regulator prohibits sending client data to third parties. What approach?

Running Open-source Models via Ollama

**Ollama** is the simplest way to run an open-source model locally. A single binary, automatic model downloading, an OpenAI-compatible API. Setup: 2 minutes. Then `ollama run llama3.1:8b`, and the model responds locally.

The key insight: **Ollama automatically starts an HTTP server** on port 11434 with an OpenAI-compatible API. Any code written for OpenAI works with Ollama - just change `baseURL`. Zero migration, zero new SDK.

For production - an abstraction over providers. One `LLM_PROVIDER=openai` in `.env` changes everything. This isn't just convenience - it's protection against vendor lock-in:

**Ollama on macOS Apple Silicon** uses the Metal API for GPU acceleration. Llama 3.1 8B on MacBook Pro M3 Pro: ~40-60 tok/s - sufficient for realtime chat. On CPU: ~5-10 tok/s. Perfect for dev and testing. For production under load, use vLLM or llama.cpp server on a dedicated GPU.

A NestJS project uses the OpenAI SDK for GPT-4o. An Ollama fallback needs to be added. What changes?

Decision Framework: Open-source vs Closed Models

Choosing between open-source and closed (GPT-4o, Claude) is an architectural decision for years ahead. It affects cost, latency, privacy, vendor lock-in. There's no universally correct answer - only the correct one for a specific context.

Criterion	Open-source (self-hosted)	Closed (API)
Cost (low volume, <1K req/day)	More expensive - GPU server USD 500-2000/mo	Cheaper - pay per token, USD 10-50/mo
Cost (high volume, >100K req/day)	Cheaper - fixed cost GPU	More expensive - USD 5K-50K/mo
Quality (general)	5-15% below GPT-4o	Best (GPT-4o, Claude 3.5)
Quality (fine-tuned)	Can outperform GPT-4o on a specific task	Limited by provider
Latency	50-200ms (local GPU)	200-2000ms (depends on load)
Data privacy	Full control	Data goes through a third party
Uptime / SLA	Team responsibility	99.9% SLA from provider
Vendor lock-in	None	High
Team needs	ML engineer for GPU infra	Backend developer only

**The hybrid approach** is often the optimal strategy. 80% of requests are simple: classification, short answers, routing. Those go to local Llama 8B via Ollama. The top 5% of complex tasks - reasoning, code generation - go to GPT-4o.

**Together AI, Fireworks, Groq** are cloud providers of open-source models. No need to manage GPUs - pay per token, but cheaper than OpenAI. Llama 3.1 70B via Together: USD 0.88/M tokens vs GPT-4o: USD 2.50/M. A middle-ground option between "everything local" and "everything on OpenAI".

Self-hosting is always cheaper than paying for API

At low traffic (<1K req/day) a GPU server at USD 500-2000/mo is more expensive than USD 10-50/mo in API tokens

GPU rental is a fixed cost - it runs regardless of traffic. At 100 requests/day, GPT-4o-mini costs a few dollars a month. A dedicated GPU breaks even only at high, stable traffic (>50K req/day), strict latency requirements, or data sovereignty constraints. Always calculate: monthly_gpu_cost / (daily_requests * 30) vs token_cost_per_request.

A fintech startup: 50K requests/day. 80% are transaction classification, 20% are fraud analysis. The regulator requires EU data residency. What's the optimal architecture?

Open-source = lower quality, closed = better

Llama 3.1 405B and DeepSeek V3 are neck and neck with GPT-4o. On specialized tasks fine-tuned open-source wins

In 2023 the gap was real - 20-30 Elo on Arena. In 2026 the gap is under 50 Elo - within noise. On domain-specific tasks (legal documents, medical records, specific languages) fine-tuned Llama 8B regularly beats GPT-4o - specialization beats generalization.

Self-hosting always saves money compared to API

At <1K req/day, API is cheaper. GPU pays off only at high traffic or data sovereignty requirements

GPU server costs USD 500-2000/mo regardless of traffic. At 100 requests/day GPT-4o-mini costs a few dollars a month. Break-even comes at ~50K req/day for an average request. Always calculate: monthly_gpu_cost / (daily_requests * 30) vs token_cost_per_request.

Key Takeaways

Open-source LLMs in 2026 are production-ready: Llama 3.1 405B, DeepSeek V3, Qwen 2.5 72B compete with GPT-4o on Chatbot Arena
MoE (Mixtral, DeepSeek) - large model quality at small model speed: router activates 2-4 of N experts
Benchmarks are a first-level filter. The final eval is 50-100 examples from a production task
Apache 2.0 / MIT (Mistral, Qwen, Phi) - maximum freedom. Llama License - permissive, but with 700M MAU restriction
Ollama: one command, OpenAI-compatible API - no code changes, just baseURL and model name
Self-hosting pays off at >50K req/day or data sovereignty. At low traffic, API is cheaper
Hybrid routing: 80% → local Llama 8B, 5% → GPT-4o - 80-90% cost savings with maintained quality

Вопросы для размышления

In which scenario would a fine-tuned Llama 8B beat GPT-4o? What would it take to verify that?
Project: 10K req/day, client data under GDPR, team of 3 backend developers. Open-source or API? Why?
Mixtral 8x7B activates 2 of 8 experts per token. How does the router "know" which experts to pick? What happens when a task requires knowledge from multiple domains?

What's Next

Open-source models unlock new possibilities: distillation, local deployment, custom serving.

Model Distillation — GPT-4o generates training data → fine-tune open-source → 90% quality at 1% cost
Local LLM — Details of running models: GGUF quantization, GPU requirements, production serving via vLLM and llama.cpp
Fine-tuning — LoRA/QLoRA for fine-tuning open-source models on a single GPU

Связанные уроки

aie-03-llm-fundamentals — Open models share the same transformer basis
aie-39-local-models — Open weights make local inference possible
aie-36-fine-tuning — Open weights unlock full fine-tuning
aie-38-distillation — Distill big open models into small ones
ml-31-transformers — Same architecture under different licenses
ml-01