AI Engineering

The AI Landscape in 2026: Who's Who, What's Where, and Where It's All Going

Цели урока

  • Know the key players in the AI industry and their strengths
  • Distinguish model types: chat, reasoning, embedding, speech, vision
  • Calculate AI API costs for production
  • Choose a stack for an AI project based on requirements

Предварительные знания

  • AI for Backend Devs
  • AI for Backend Devs

January 2023. ChatGPT hit 100 million users in two months - faster than any product in history. Instagram: 2.5 years. TikTok: 9 months. GPT API: first year - 2 million developers. The AI engineering profession was born in a single quarter. Stripe migrated from GPT-4 to Claude in one sprint - cutting costs while improving quality. Without a map of this landscape, architectural decisions go stale faster than the code they are buried in.

  • **Stripe** switched from GPT-4 to Claude for code generation - quality improved 15%, cost dropped at comparable per-token price
  • **DoorDash** uses GPT-4o-mini for classification, GPT-4o for complex analysis - saving USD 200K/year through model routing
  • **Notion** tested 5 different models before landing on the optimal mix for different features of the application
  • **Perplexity AI** handles 100M+ requests per month through a RAG stack spanning multiple providers simultaneously
  • **Character.ai** - 20 billion messages per day, conversation memory at scale with self-hosted models
  • **GitHub Copilot** - embeddings + code-specific LLM, roughly 3 cents per completion on a USD 10/month subscription

How a Whole Market Appeared in 3 Years

**Brown et al. 2020** - the GPT-3 paper: 175 billion parameters, few-shot learning without fine-tuning. No public API. **June 2020:** Sam Altman and Greg Brockman launched the OpenAI API - the first step toward democratizing AI access, giving dozens of companies access to GPT-3 through a REST endpoint. **November 2022**: ChatGPT - 1 million users in 5 days. **March 2023**: GPT-4 API opens to developers. The role of AI Backend Engineer is born. **July 2023**: Meta releases Llama 2 - open weights, runs locally. **2024**: tool calling, multimodal, reasoning models (o1). Claude 3 surpasses GPT-4 on code benchmarks. **2025**: MCP protocol, USD 206K average AI engineer salary in the US per LinkedIn. **2026**: what started as a one-company show is now a competitive market.

The Major Players: Who Makes AI Models

January 2023. ChatGPT hit 100 million users in two months - faster than any product in history. Instagram: 2.5 years. TikTok: 9 months. GPT API: first year - 2 million developers. The AI engineering profession was born in a single quarter.

CompanyKey ModelsStrengthAPI Access
OpenAIGPT-4o, GPT-4o-mini, o1, o3Ecosystem, ChatGPT, broad functionalityapi.openai.com
AnthropicClaude Sonnet 4.6, Claude Haiku 4.5, Claude Opus 4.7Safety, long context (200K-1M), codeapi.anthropic.com
GoogleGemini 2.5 Pro, Gemini FlashMultimodal, Google Cloud integration, 1M contextai.google.dev
MetaLlama 3.1 (8B, 70B, 405B)Open source, run locally, no vendor lock-inFree (self-hosted)
MistralMistral Large, Mixtral, CodestralEuropean alternative, open weights, GDPR-friendlyapi.mistral.ai
xAIGrokX/Twitter integration, real-time dataapi.x.ai

**Closed-source vs Open-source:** OpenAI and Anthropic provide access only through APIs - model weights are locked away. Meta and Mistral publish weights - the model can be run on a private server. That changes everything: cost, privacy, compliance, and vendor lock-in risk.

There is a non-obvious freedom here for backend engineers: **no lock-in to a single provider**. The OpenAI SDK and Anthropic SDK are structurally similar. A proper architecture - a thin adapter layer - allows switching in minutes. DoorDash does exactly this: GPT-4o-mini for classification, GPT-4o for complex cases, and that routing saves them USD 200K a year.

Which company allows running their models on private infrastructure (open weights)?

Types of Models: Which One for What

Using GPT-4o to classify support tickets is like hiring a cardiac surgeon to slice bread. It'll work. It'll cost 30x more than necessary. **Model routing** - picking the right tool for the task - is one of the first things experienced AI Backend Engineers get right.

Model TypeWhat It DoesExamplesWhen to Use
Chat / ReasoningGenerates text, reasons, writes codeGPT-4o, Claude Sonnet 4.6, Gemini 2.5 ProChatbots, analysis, generation
Small / FastSame thing, cheaper and fasterGPT-4o-mini, Claude Haiku 4.5, Gemini FlashClassification, simple tasks, high RPS
ReasoningDeep step-by-step reasoningo1, o3, Claude with extended thinkingMath, complex logic, planning
EmbeddingConverts text into a vectortext-embedding-3-small/large, dim=1536Semantic search, RAG, clustering
Speech-to-TextRecognizes speechWhisper, Deepgram Nova-2Voice interfaces, transcription
Text-to-SpeechVocalizes text, latency <300msOpenAI TTS, ElevenLabsVoice assistants, audio content
Image GenerationGenerates imagesDALL-E 3, Midjourney, Stable DiffusionContent, design, avatars
VisionUnderstands imagesGPT-4o (vision), Claude (vision)OCR, screenshot analysis, moderation

Embedding models deserve special attention: `text-embedding-3-small` converts text into a 1536-dimensional vector for USD 0.02 per million tokens. That is **125x cheaper** than GPT-4o-mini for generation. Every semantic search in a RAG system, every clustering pipeline - that is all embeddings, not expensive chat models.

**The reasoning model trap:** o1 and o3 look impressive on benchmarks - but they are 10-50x more expensive and 3-5x slower than standard models. Deploying o3 for ticket classification is Ferrari-to-the-grocery-store engineering. Reasoning models exist for multi-step logic, not for fast inference at scale.

For mass classification of 100K support tickets by category, the best choice is:

Pricing: How Much AI Costs in Production

All LLM APIs charge by **tokens** - units of text. Roughly 1 token = 4 characters in English. The word "classification" is 1 token. Pricing is split between input (prompt) and output (generated text) - output is usually 4-6x more expensive because generation is heavier than inference.

ModelInput (per 1M tokens)Output (per 1M)Context
GPT-4oUSD 2.50USD 10.00128K
GPT-4o-miniUSD 0.15USD 0.60128K
Claude Opus 4.7USD 5.00USD 25.001M
Claude Sonnet 4.6USD 3.00USD 15.00200K
Claude Haiku 4.5USD 1.00USD 5.00200K
Gemini 2.5 ProUSD 1.25USD 10.001M
Gemini FlashUSD 0.075USD 0.301M
DeepSeek V3USD 0.014USD 0.028128K
Llama 3.1 70B (self-hosted)~USD 0.50~USD 1.00128K

**Example cost calculation for a chatbot:**

These are not hypothetical numbers. Stripe moved from GPT-4 to Claude for code generation: quality went up 15%, cost went down. Notion tried five different models before locking in their optimal mix. **Model selection is an engineering decision that gets revisited every few months** as new SOTA drops and prices fall.

**The rule:** start with the cheapest model (mini/haiku/flash), upgrade only when quality is not meeting the bar. Most production tasks do not need flagship models. Gemini Flash is USD 0.075/1M input, Gemini 2.5 Pro is USD 1.25/1M. For many tasks the quality delta is zero.

A chatbot processes 10,000 conversations per day. Average conversation: 400 input + 200 output tokens. How much does a day of operation cost on GPT-4o-mini (input USD 0.15/1M, output USD 0.60/1M)? Answer in dollars, rounded to cents.

How to Choose a Stack for an AI Project

Choosing a model is not about "which one is the smartest." It is an engineering decision: privacy, budget, latency, context window, compliance. Here is the decision matrix:

**Typical AI Backend Engineer stack in 2026:**

  • **Runtime:** Node.js / TypeScript (or Python for ML-heavy tasks)
  • **Framework:** NestJS / Fastify / Express
  • **LLM:** OpenAI SDK + Anthropic SDK (fallback)
  • **Embeddings:** text-embedding-3-small + pgvector, 1536 dim
  • **STT/TTS:** Whisper + ElevenLabs streaming (latency <300ms)
  • **Vector DB:** pgvector (when PostgreSQL is already in the stack) or Qdrant (when speed matters)
  • **Orchestration:** LangChain.js or custom pipeline
  • **Monitoring:** Langfuse / Helicone - cost per request, p95 latency

One idea threads through all of this: **abstraction over the provider**. The AI industry moves every 3-6 months. A new SOTA lasts about 2 months. Claude Sonnet can beat GPT-4o on code tasks - and teams that hard-coded a single SDK throughout their codebase end up rewriting integrations. Teams that built a thin adapter layer change one config line.

**The standing rule in AI Engineering:** never pick a stack forever. Design so that swapping a model or provider takes hours, not weeks. That gap - hours vs weeks - is one of the clearest signals between junior and senior AI Backend Engineers.

An AI product is being built for a European bank. Client data cannot leave the bank's infrastructure. What approach to model selection?

Pick one model and use it for everything

Production AI systems use different models for different tasks - that is called model routing

Classifying a support ticket and writing complex code have different error costs and different optimal quality/price tradeoffs. GPT-4o for classification is 16x overspend with no quality gain. Claude Sonnet for code generation over GPT-4o is +15% quality at comparable cost. The engineering answer: a router that picks the model for the task.

Open-source models are a quality compromise

Llama 3.1 405B competes with GPT-4o on many tasks, and self-hosted costs 5-10x less

Llama 3.1 70B via Groq costs around USD 0.05/1M tokens vs USD 2.50/1M for GPT-4o. On classification, summarization, and structured extraction tasks - the quality gap is minimal or zero. The cost gap is 50x. Financial and medical organizations go self-hosted not for savings but for compliance - and they do not compromise on quality to get there.

Key Takeaways

  • OpenAI and Anthropic lead closed-source. Meta and Mistral lead open-source with full data control and no vendor lock-in
  • Flagship models for complex tasks. GPT-4o-mini / Claude Haiku / Gemini Flash for high-volume and simple ones. Price gap: 10-30x
  • text-embedding-3-small (USD 0.02/1M) for semantic search in RAG - a separate category from chat models, 125x cheaper
  • Cost is an engineering variable: DoorDash saves USD 200K/year through model routing
  • Abstract over the provider - new SOTA lasts about 2 months, and swapping should take hours, not weeks
  • Sam Altman and Greg Brockman launched the OpenAI API in June 2020 - that moment democratized AI access and created the AI Backend Engineer profession

Вопросы для размышления

  • If building a support chatbot for a SaaS product with 50K requests per day - which model to start with? What reasoning gets there?
  • What is the difference between a compliance reason and an economic reason for going self-hosted? Which dominates in which kind of project?
  • Why is provider abstraction important if GPT-4o is the best model today - why design for the ability to switch?

What's Next

The landscape is mapped. Now we go under the hood: how LLMs work internally - tokens, embeddings, attention - and why that matters for engineering decisions.

  • How LLMs Work Internally — What happens inside a model when a prompt is sent
  • Cost and Optimization — Detailed lesson on managing AI API expenses

Связанные уроки

  • aie-03-llm-fundamentals — Understanding the landscape opens the inner mechanics of LLMs
  • aie-29-cost-management — Deep cost optimization after the baseline pricing table
  • aie-09-embeddings — Embedding models are a separate category in the landscape
  • aie-22-model-routing — Model routing is the next step after grasping the landscape
  • ml-37-bert-gpt — Architecture history: from BERT/GPT to modern production models
  • aie-37-open-source-models — A close look at open-source alternatives and self-hosted stacks
  • ml-01-intro
The AI Landscape in 2026: Who's Who, What's Where, and Where It's All Going

0

1

Sign In