AI Engineering

The AI Landscape in 2026: Who's Who, What's Where, and Where It's All Going

Цели урока

Know the key players in the AI industry and their strengths
Distinguish model types: chat, reasoning, embedding, speech, vision
Calculate AI API costs for production
Choose a stack for an AI project based on requirements

Предварительные знания

AI for Backend Devs

AI for Backend Devs

January 2023. ChatGPT hit 100 million users in two months - faster than any product in history. Instagram: 2.5 years. TikTok: 9 months. GPT API: first year - 2 million developers. The AI engineering profession was born in a single quarter. Stripe migrated from GPT-4 to Claude in one sprint - cutting costs while improving quality. Without a map of this landscape, architectural decisions go stale faster than the code they are buried in.

**Stripe** switched from GPT-4 to Claude for code generation - quality improved 15%, cost dropped at comparable per-token price
**DoorDash** uses GPT-4o-mini for classification, GPT-4o for complex analysis - saving USD 200K/year through model routing
**Notion** tested 5 different models before landing on the optimal mix for different features of the application
**Perplexity AI** handles 100M+ requests per month through a RAG stack spanning multiple providers simultaneously
**Character.ai** - 20 billion messages per day, conversation memory at scale with self-hosted models
**GitHub Copilot** - embeddings + code-specific LLM, roughly 3 cents per completion on a USD 10/month subscription

How a Whole Market Appeared in 3 Years

**Brown et al. 2020** - the GPT-3 paper: 175 billion parameters, few-shot learning without fine-tuning. No public API. **June 2020:** Sam Altman and Greg Brockman launched the OpenAI API - the first step toward democratizing AI access, giving dozens of companies access to GPT-3 through a REST endpoint. **November 2022**: ChatGPT - 1 million users in 5 days. **March 2023**: GPT-4 API opens to developers. The role of AI Backend Engineer is born. **July 2023**: Meta releases Llama 2 - open weights, runs locally. **2024**: tool calling, multimodal, reasoning models (o1). Claude 3 surpasses GPT-4 on code benchmarks. **2025**: MCP protocol, USD 206K average AI engineer salary in the US per LinkedIn. **2026**: what started as a one-company show is now a competitive market.

The Major Players: Who Makes AI Models

Company	Key Models	Strength	API Access
OpenAI	GPT-4o, GPT-4o-mini, o1, o3	Ecosystem, ChatGPT, broad functionality	api.openai.com
Anthropic	Claude Sonnet 4.6, Claude Haiku 4.5, Claude Opus 4.7	Safety, long context (200K-1M), code	api.anthropic.com
Google	Gemini 2.5 Pro, Gemini Flash	Multimodal, Google Cloud integration, 1M context	ai.google.dev
Meta	Llama 3.1 (8B, 70B, 405B)	Open source, run locally, no vendor lock-in	Free (self-hosted)
Mistral	Mistral Large, Mixtral, Codestral	European alternative, open weights, GDPR-friendly	api.mistral.ai
xAI	Grok	X/Twitter integration, real-time data	api.x.ai

**Closed-source vs Open-source:** OpenAI and Anthropic provide access only through APIs - model weights are locked away. Meta and Mistral publish weights - the model can be run on a private server. That changes everything: cost, privacy, compliance, and vendor lock-in risk.

There is a non-obvious freedom here for backend engineers: **no lock-in to a single provider**. The OpenAI SDK and Anthropic SDK are structurally similar. A proper architecture - a thin adapter layer - allows switching in minutes. DoorDash does exactly this: GPT-4o-mini for classification, GPT-4o for complex cases, and that routing saves them USD 200K a year.

Which company allows running their models on private infrastructure (open weights)?

Types of Models: Which One for What

Using GPT-4o to classify support tickets is like hiring a cardiac surgeon to slice bread. It'll work. It'll cost 30x more than necessary. **Model routing** - picking the right tool for the task - is one of the first things experienced AI Backend Engineers get right.

Model Type	What It Does	Examples	When to Use
Chat / Reasoning	Generates text, reasons, writes code	GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Pro	Chatbots, analysis, generation
Small / Fast	Same thing, cheaper and faster	GPT-4o-mini, Claude Haiku 4.5, Gemini Flash	Classification, simple tasks, high RPS
Reasoning	Deep step-by-step reasoning	o1, o3, Claude with extended thinking	Math, complex logic, planning
Embedding	Converts text into a vector	text-embedding-3-small/large, dim=1536	Semantic search, RAG, clustering
Speech-to-Text	Recognizes speech	Whisper, Deepgram Nova-2	Voice interfaces, transcription
Text-to-Speech	Vocalizes text, latency <300ms	OpenAI TTS, ElevenLabs	Voice assistants, audio content
Image Generation	Generates images	DALL-E 3, Midjourney, Stable Diffusion	Content, design, avatars
Vision	Understands images	GPT-4o (vision), Claude (vision)	OCR, screenshot analysis, moderation

Embedding models deserve special attention: `text-embedding-3-small` converts text into a 1536-dimensional vector for USD 0.02 per million tokens. That is **125x cheaper** than GPT-4o-mini for generation. Every semantic search in a RAG system, every clustering pipeline - that is all embeddings, not expensive chat models.

**The reasoning model trap:** o1 and o3 look impressive on benchmarks - but they are 10-50x more expensive and 3-5x slower than standard models. Deploying o3 for ticket classification is Ferrari-to-the-grocery-store engineering. Reasoning models exist for multi-step logic, not for fast inference at scale.

For mass classification of 100K support tickets by category, the best choice is:

Pricing: How Much AI Costs in Production

All LLM APIs charge by **tokens** - units of text. Roughly 1 token = 4 characters in English. The word "classification" is 1 token. Pricing is split between input (prompt) and output (generated text) - output is usually 4-6x more expensive because generation is heavier than inference.

Model	Input (per 1M tokens)	Output (per 1M)	Context
GPT-4o	USD 2.50	USD 10.00	128K
GPT-4o-mini	USD 0.15	USD 0.60	128K
Claude Opus 4.7	USD 5.00	USD 25.00	1M
Claude Sonnet 4.6	USD 3.00	USD 15.00	200K
Claude Haiku 4.5	USD 1.00	USD 5.00	200K
Gemini 2.5 Pro	USD 1.25	USD 10.00	1M
Gemini Flash	USD 0.075	USD 0.30	1M
DeepSeek V3	USD 0.014	USD 0.028	128K
Llama 3.1 70B (self-hosted)	~USD 0.50	~USD 1.00	128K

**Example cost calculation for a chatbot:**

These are not hypothetical numbers. Stripe moved from GPT-4 to Claude for code generation: quality went up 15%, cost went down. Notion tried five different models before locking in their optimal mix. **Model selection is an engineering decision that gets revisited every few months** as new SOTA drops and prices fall.

**The rule:** start with the cheapest model (mini/haiku/flash), upgrade only when quality is not meeting the bar. Most production tasks do not need flagship models. Gemini Flash is USD 0.075/1M input, Gemini 2.5 Pro is USD 1.25/1M. For many tasks the quality delta is zero.

A chatbot processes 10,000 conversations per day. Average conversation: 400 input + 200 output tokens. How much does a day of operation cost on GPT-4o-mini (input USD 0.15/1M, output USD 0.60/1M)? Answer in dollars, rounded to cents.

How to Choose a Stack for an AI Project

Choosing a model is not about "which one is the smartest." It is an engineering decision: privacy, budget, latency, context window, compliance. Here is the decision matrix:

**Typical AI Backend Engineer stack in 2026:**

**Runtime:** Node.js / TypeScript (or Python for ML-heavy tasks)
**Framework:** NestJS / Fastify / Express
**LLM:** OpenAI SDK + Anthropic SDK (fallback)
**Embeddings:** text-embedding-3-small + pgvector, 1536 dim
**STT/TTS:** Whisper + ElevenLabs streaming (latency <300ms)
**Vector DB:** pgvector (when PostgreSQL is already in the stack) or Qdrant (when speed matters)
**Orchestration:** LangChain.js or custom pipeline
**Monitoring:** Langfuse / Helicone - cost per request, p95 latency

One idea threads through all of this: **abstraction over the provider**. The AI industry moves every 3-6 months. A new SOTA lasts about 2 months. Claude Sonnet can beat GPT-4o on code tasks - and teams that hard-coded a single SDK throughout their codebase end up rewriting integrations. Teams that built a thin adapter layer change one config line.

**The standing rule in AI Engineering:** never pick a stack forever. Design so that swapping a model or provider takes hours, not weeks. That gap - hours vs weeks - is one of the clearest signals between junior and senior AI Backend Engineers.

An AI product is being built for a European bank. Client data cannot leave the bank's infrastructure. What approach to model selection?

Pick one model and use it for everything

Production AI systems use different models for different tasks - that is called model routing

Classifying a support ticket and writing complex code have different error costs and different optimal quality/price tradeoffs. GPT-4o for classification is 16x overspend with no quality gain. Claude Sonnet for code generation over GPT-4o is +15% quality at comparable cost. The engineering answer: a router that picks the model for the task.

Open-source models are a quality compromise

Llama 3.1 405B competes with GPT-4o on many tasks, and self-hosted costs 5-10x less

Llama 3.1 70B via Groq costs around USD 0.05/1M tokens vs USD 2.50/1M for GPT-4o. On classification, summarization, and structured extraction tasks - the quality gap is minimal or zero. The cost gap is 50x. Financial and medical organizations go self-hosted not for savings but for compliance - and they do not compromise on quality to get there.

Key Takeaways

OpenAI and Anthropic lead closed-source. Meta and Mistral lead open-source with full data control and no vendor lock-in
Flagship models for complex tasks. GPT-4o-mini / Claude Haiku / Gemini Flash for high-volume and simple ones. Price gap: 10-30x
text-embedding-3-small (USD 0.02/1M) for semantic search in RAG - a separate category from chat models, 125x cheaper
Cost is an engineering variable: DoorDash saves USD 200K/year through model routing
Abstract over the provider - new SOTA lasts about 2 months, and swapping should take hours, not weeks
Sam Altman and Greg Brockman launched the OpenAI API in June 2020 - that moment democratized AI access and created the AI Backend Engineer profession

Вопросы для размышления

If building a support chatbot for a SaaS product with 50K requests per day - which model to start with? What reasoning gets there?
What is the difference between a compliance reason and an economic reason for going self-hosted? Which dominates in which kind of project?
Why is provider abstraction important if GPT-4o is the best model today - why design for the ability to switch?

What's Next

The landscape is mapped. Now we go under the hood: how LLMs work internally - tokens, embeddings, attention - and why that matters for engineering decisions.

How LLMs Work Internally — What happens inside a model when a prompt is sent
Cost and Optimization — Detailed lesson on managing AI API expenses

Связанные уроки

aie-03-llm-fundamentals — Understanding the landscape opens the inner mechanics of LLMs
aie-29-cost-management — Deep cost optimization after the baseline pricing table
aie-09-embeddings — Embedding models are a separate category in the landscape
aie-22-model-routing — Model routing is the next step after grasping the landscape
ml-37-bert-gpt — Architecture history: from BERT/GPT to modern production models
aie-37-open-source-models — A close look at open-source alternatives and self-hosted stacks
ml-01-intro