AI Engineering
Open Source Models: Llama, Mistral, Qwen, Gemma - Choosing an Alternative to GPT
Цели урока
- Navigate the open-source LLM landscape: Llama, Mistral, Qwen, Gemma, DeepSeek
- Compare models using benchmarks and understand benchmark limitations
- Understand licenses: Apache 2.0, MIT, Llama License - what's permitted commercially
- Run open-source models via Ollama and integrate with TypeScript
- Make the open-source vs closed API decision for a specific project
Llama 3.1 405B outperformed GPT-4 on several benchmarks. And shipped open-source. Meta spent billions - and gave it away for free. Six months later Llama became the foundation for hundreds of products. This isn't philanthropy - it's strategy: the more developers build on Llama, the stronger the ecosystem, the more fine-tune data flows back, the better the next version. OpenAI created the market. Meta made it open.
- Meta: Llama downloaded 300+ million times, used by 50K+ companies - the largest open-source LLM release in history
- Uber moved AI services to self-hosted Llama - saving USD 10M/year while maintaining quality
- DeepSeek R1 - the first open-source reasoning model on par with o1, shipped 3 months after o1
- EU AI Act encourages open-source: easier transparency, weight auditing, compliance without vendor lock-in
- Chatbot Arena 2026: gap between top open-source (Llama 405B) and GPT-4o - under 50 Elo points
How open-source LLMs went from toy to production
**February 2023**: Meta releases Llama 1 - weights leak in 48 hours, community runs the model on a MacBook within a week. **July 2023**: Llama 2 - officially open, commercial-friendly, downloaded millions of times. **December 2023**: Mistral releases Mixtral 8x7B - the first MoE model competing with GPT-3.5 at three times lower compute. **April 2024**: Llama 3, gap with GPT-4o shrinks to 10-15%. **January 2025**: DeepSeek R1 - open-source reasoning at o1 level. **2026**: open-source LLMs are the standard for compliance-constrained industries and high-load systems.
Предварительные знания
Open-source LLM Landscape in 2026
July 2023. Meta drops Llama 2. Within 48 hours - 100,000 access requests. Within a month the model is running on MacBooks, Raspberry Pis, and decade-old gaming GPUs. This isn't just a release - it's the moment the monopoly on production-grade LLMs ended.
In 2026, open-source LLMs are not "almost as good as GPT-4". Llama 3.1 405B, DeepSeek V3, and Qwen 2.5 72B compete with GPT-4o on Chatbot Arena by real user preferences. The gap has shrunk from a chasm to measurement noise.
Key families dominating in 2026:
| Family | Company | Sizes | Key Feature |
|---|---|---|---|
| Llama 3.x / 4 | Meta | 8B, 70B, 405B | Most popular, huge fine-tune and tooling ecosystem |
| Mistral / Mixtral | Mistral AI | 7B, 8x7B, 8x22B, Large | MoE architecture, strong reasoning, European company |
| Qwen 2.5 / 3 | Alibaba | 0.5B-72B | Best Chinese + multilingual support, strong coding |
| Gemma 2 / 3 | 2B, 9B, 27B | Compact, efficient, good for edge deployment | |
| DeepSeek V3 / R1 | DeepSeek | 7B, 67B, 671B (MoE) | State-of-the-art reasoning, MoE, open training details |
| Phi-3 / Phi-4 | Microsoft | 3.8B, 14B | Best in their size class, SLM (Small Language Models) |
| Command R+ | Cohere | 35B, 104B | Optimized for RAG and enterprise |
The architectural breakthrough that changed the rules: **Mixture of Experts (MoE)**. The model doesn't use all parameters for every token - a router selects 2-4 "experts" from N. Mixtral 8x7B activates ~13B of 47B parameters. Speed of a small model, quality of a large one. That's why DeepSeek V3 with 671B parameters is cheaper to deploy than a dense Llama 405B.
The pace is staggering: every 3-4 months a new generation appears that outperforms the previous by 10-20% on benchmarks. A model that was state-of-the-art in January is behind 2-3 competitors by summer. **Hugging Face Hub** is the center of this ecosystem: 800,000+ models, each with a model card, benchmarks, and discussion threads. URL: huggingface.co/models.
Mixtral 8x7B has 47B total parameters. How many parameters are activated when processing a single token?
Comparing Models: Benchmarks and Real-World Performance
MMLU 87% vs 84% - Model A wins. Then in production on real data, it turns out Model B is twice as fast, cheaper, and follows instructions better. Benchmarks are a first-pass filter, not a verdict.
| Benchmark | What It Tests | Format | Limitation |
|---|---|---|---|
| MMLU (5-shot) | General knowledge (57 subjects) | Multiple choice | Tests memorization, not reasoning |
| MMLU-Pro | Harder knowledge questions | 10 answer options | Better than MMLU, but still multiple choice |
| HumanEval / MBPP | Code generation (Python) | Write a function | Python only, short functions |
| GSM8K | Math (school-level) | Word problems | Too easy for new models |
| MATH | Math (competition-level) | Formal problems | Good, but doesn't cover applied math |
| MT-Bench | Conversational ability | LLM-as-judge, GPT-4 | Depends on judge model |
| Arena Elo (Chatbot Arena) | Real human preferences | A/B comparisons | Gold standard, but expensive and slow |
| IFEval | Instruction following | Strict format compliance | Critical for production |
**Approximate ranking** (early 2026, changes with every release):
A smaller model can beat a larger one - not as an exception, but as the rule for specialized tasks:
- **Specific domain** - a fine-tuned Llama 8B on medical data can outperform GPT-4o on medical tasks
- **Latency** - an 8B model responds in 200ms, a 70B in 2s, GPT-4o in 1-5s. For realtime applications latency is critical
- **Language** - Qwen 2.5 is significantly better than Llama on Chinese, Mistral is better on French
- **Contamination** - some models "saw" benchmark tasks during training, which inflates their scores
A model should not be chosen based on a single benchmark. MMLU 85% for Model A and 83% for Model B - the difference is within the margin of error. Always test on production data: 50-100 real examples from a real task provide more insight than all benchmarks combined.
Open-source models are lower quality - otherwise why pay for GPT-4o?
Llama 3.1 405B and DeepSeek V3 are neck and neck with GPT-4o on Chatbot Arena. On specialized tasks, fine-tuned open-source often wins
Chatbot Arena measures real user preferences in blind comparisons. The gap between top open-source and GPT-4o shrank from 100+ Elo in 2023 to under 50 in 2026. On domain-specific tasks (medical records, legal documents, specific languages) fine-tuned Llama 8B regularly beats GPT-4o - specialization beats generalization.
Model A: MMLU=87%, HumanEval=82%. Model B: MMLU=84%, HumanEval=79%. What conclusion is correct?
Licenses: Llama License vs Apache 2.0 vs MIT
"Open-source" in the LLM context is an imprecise term. Most models have **open weights**, but that's not the same as open-source software. Apache 2.0 and MIT give near-unlimited freedom. Llama License is permissive, but with nuances. Knowing the difference matters before production deployment, not after.
| License | Models | Commercial use | Fine-tuning | Restrictions |
|---|---|---|---|---|
| Apache 2.0 | Mistral 7B, Qwen 2.5, Gemma | Yes | Yes | Minimal - standard open-source |
| MIT | Phi-3, Phi-4 | Yes | Yes | Minimal, even more permissive than Apache |
| Llama License | Llama 3.x / 4 | Yes (with restrictions) | Yes | 700M MAU limit |
| DeepSeek License | DeepSeek V3/R1 | Yes | Yes | Very permissive, similar to MIT |
| Cohere C4AI | Command R+ | Not for >USD 1M revenue | Yes | Revenue limit |
| Proprietary API | GPT-4o, Claude | Via API | Via API | No access to weights, vendor lock-in |
- **Startup / SMB** - Llama License works fine. Apache 2.0 (Mistral, Qwen) is even simpler - no MAU restrictions
- **Enterprise** - legal departments prefer Apache 2.0 or MIT. Llama License requires review
- **Healthcare / Finance** - open-source on own hardware may be the only option for compliance
- **EU AI Act** - open-source models are easier for compliance: weight auditing, deployment control
**"Open-source" does not equal "open training data".** Most models do NOT disclose training data. EU AI Act compliance may require transparency about training data. Exception - OLMo from AI2 (fully open: weights + data + code).
A startup (10K users) is building a SaaS. The regulator prohibits sending client data to third parties. What approach?
Running Open-source Models via Ollama
**Ollama** is the simplest way to run an open-source model locally. A single binary, automatic model downloading, an OpenAI-compatible API. Setup: 2 minutes. Then `ollama run llama3.1:8b`, and the model responds locally.
The key insight: **Ollama automatically starts an HTTP server** on port 11434 with an OpenAI-compatible API. Any code written for OpenAI works with Ollama - just change `baseURL`. Zero migration, zero new SDK.
For production - an abstraction over providers. One `LLM_PROVIDER=openai` in `.env` changes everything. This isn't just convenience - it's protection against vendor lock-in:
**Ollama on macOS Apple Silicon** uses the Metal API for GPU acceleration. Llama 3.1 8B on MacBook Pro M3 Pro: ~40-60 tok/s - sufficient for realtime chat. On CPU: ~5-10 tok/s. Perfect for dev and testing. For production under load, use vLLM or llama.cpp server on a dedicated GPU.
A NestJS project uses the OpenAI SDK for GPT-4o. An Ollama fallback needs to be added. What changes?
Decision Framework: Open-source vs Closed Models
Choosing between open-source and closed (GPT-4o, Claude) is an architectural decision for years ahead. It affects cost, latency, privacy, vendor lock-in. There's no universally correct answer - only the correct one for a specific context.
| Criterion | Open-source (self-hosted) | Closed (API) |
|---|---|---|
| Cost (low volume, <1K req/day) | More expensive - GPU server USD 500-2000/mo | Cheaper - pay per token, USD 10-50/mo |
| Cost (high volume, >100K req/day) | Cheaper - fixed cost GPU | More expensive - USD 5K-50K/mo |
| Quality (general) | 5-15% below GPT-4o | Best (GPT-4o, Claude 3.5) |
| Quality (fine-tuned) | Can outperform GPT-4o on a specific task | Limited by provider |
| Latency | 50-200ms (local GPU) | 200-2000ms (depends on load) |
| Data privacy | Full control | Data goes through a third party |
| Uptime / SLA | Team responsibility | 99.9% SLA from provider |
| Vendor lock-in | None | High |
| Team needs | ML engineer for GPU infra | Backend developer only |
**The hybrid approach** is often the optimal strategy. 80% of requests are simple: classification, short answers, routing. Those go to local Llama 8B via Ollama. The top 5% of complex tasks - reasoning, code generation - go to GPT-4o.
**Together AI, Fireworks, Groq** are cloud providers of open-source models. No need to manage GPUs - pay per token, but cheaper than OpenAI. Llama 3.1 70B via Together: USD 0.88/M tokens vs GPT-4o: USD 2.50/M. A middle-ground option between "everything local" and "everything on OpenAI".
Self-hosting is always cheaper than paying for API
At low traffic (<1K req/day) a GPU server at USD 500-2000/mo is more expensive than USD 10-50/mo in API tokens
GPU rental is a fixed cost - it runs regardless of traffic. At 100 requests/day, GPT-4o-mini costs a few dollars a month. A dedicated GPU breaks even only at high, stable traffic (>50K req/day), strict latency requirements, or data sovereignty constraints. Always calculate: monthly_gpu_cost / (daily_requests * 30) vs token_cost_per_request.
A fintech startup: 50K requests/day. 80% are transaction classification, 20% are fraud analysis. The regulator requires EU data residency. What's the optimal architecture?
Open-source = lower quality, closed = better
Llama 3.1 405B and DeepSeek V3 are neck and neck with GPT-4o. On specialized tasks fine-tuned open-source wins
In 2023 the gap was real - 20-30 Elo on Arena. In 2026 the gap is under 50 Elo - within noise. On domain-specific tasks (legal documents, medical records, specific languages) fine-tuned Llama 8B regularly beats GPT-4o - specialization beats generalization.
Self-hosting always saves money compared to API
At <1K req/day, API is cheaper. GPU pays off only at high traffic or data sovereignty requirements
GPU server costs USD 500-2000/mo regardless of traffic. At 100 requests/day GPT-4o-mini costs a few dollars a month. Break-even comes at ~50K req/day for an average request. Always calculate: monthly_gpu_cost / (daily_requests * 30) vs token_cost_per_request.
Key Takeaways
- Open-source LLMs in 2026 are production-ready: Llama 3.1 405B, DeepSeek V3, Qwen 2.5 72B compete with GPT-4o on Chatbot Arena
- MoE (Mixtral, DeepSeek) - large model quality at small model speed: router activates 2-4 of N experts
- Benchmarks are a first-level filter. The final eval is 50-100 examples from a production task
- Apache 2.0 / MIT (Mistral, Qwen, Phi) - maximum freedom. Llama License - permissive, but with 700M MAU restriction
- Ollama: one command, OpenAI-compatible API - no code changes, just baseURL and model name
- Self-hosting pays off at >50K req/day or data sovereignty. At low traffic, API is cheaper
- Hybrid routing: 80% → local Llama 8B, 5% → GPT-4o - 80-90% cost savings with maintained quality
Вопросы для размышления
- In which scenario would a fine-tuned Llama 8B beat GPT-4o? What would it take to verify that?
- Project: 10K req/day, client data under GDPR, team of 3 backend developers. Open-source or API? Why?
- Mixtral 8x7B activates 2 of 8 experts per token. How does the router "know" which experts to pick? What happens when a task requires knowledge from multiple domains?
What's Next
Open-source models unlock new possibilities: distillation, local deployment, custom serving.
- Model Distillation — GPT-4o generates training data → fine-tune open-source → 90% quality at 1% cost
- Local LLM — Details of running models: GGUF quantization, GPU requirements, production serving via vLLM and llama.cpp
- Fine-tuning — LoRA/QLoRA for fine-tuning open-source models on a single GPU
Связанные уроки
- aie-03-llm-fundamentals — Open models share the same transformer basis
- aie-39-local-models — Open weights make local inference possible
- aie-36-fine-tuning — Open weights unlock full fine-tuning
- aie-38-distillation — Distill big open models into small ones
- ml-31-transformers — Same architecture under different licenses
- ml-01