AI Engineering

Model Routing: Choosing Models on the Fly - GPT-4 vs Claude vs Local Models

Цели урока

  • Understand why model routing saves 60-80% of budget without quality loss
  • Implement complexity-based routing: keyword routing, LLM classifier, embedding KNN
  • Build a cost-based router with cost per request tracking via Redis
  • Implement latency-based routing with sliding window p95 and fallback chains
  • Assemble a universal Model Router: complexity x cost x latency → weighted score

70% of LLM requests are simple: rephrase, translate, extract a field. GPT-4o on them is like a Ferrari fetching groceries. gpt-4o-mini handles it for 16x less. Model routing is the director that knows the difference. One TypeScript class with routing logic cut an AI startup's bill from `USD 47,000` to `USD 18,000` per month. Stanford RouteLLM proved it: 60-80% savings with less than 1% quality degradation.

  • OpenRouter - USD 10M+ ARR on a single idea: unified API + model routing for 100+ models
  • Cursor (IDE) - routing: autocomplete → fast local model, code generation → GPT-4o/Claude Sonnet
  • ChatGPT Plus - internal routing: simple questions → GPT-4o-mini, complex → GPT-4o (o1)
  • LiteLLM - open-source router with fallback chains, cost tracking, and budget limits out of the box

Предварительные знания

  • Orchestration Patterns: routing, fallback, chain, map-reduce, branching

Model routing and the multi-model era

By 2024 engineers had a choice of dozens of models at different price and quality points: the GPT, Claude, and Gemini families plus open models. Paying for a top-tier model on every request is expensive, so the idea of routing emerged: send simple requests to a cheap model and hard ones to a strong model. In June 2024 the LMSYS team (the group behind Chatbot Arena) published the RouteLLM paper (arXiv 2406.18665), and in July an open-source framework and blog post followed. RouteLLM trains routers on preference data and dynamically picks between a strong and a weak model; by their measurements this cuts cost sharply while keeping most of the top model's quality on popular benchmarks. That turned cost and quality routing into its own engineering topic of the multi-model era.

Why Model Routing Matters

70% of LLM requests are simple: rephrase this, translate that, extract a field from JSON, answer an FAQ. GPT-4o on them is like a Ferrari fetching groceries. gpt-4o-mini handles the same for 16x less. Model routing is the traffic director that knows the difference.

January 2025. An AI startup opens an OpenAI invoice: **`USD 47,000`**. An analyst digs through the logs: 68% of requests are "What's the status of order #1234?". GPT-4o processes each one for `USD 0.01`. GPT-4o-mini would do the same for `USD 0.0003` - 33x cheaper. After implementing routing, the bill drops to `USD 18,000`. One TypeScript class saves `USD 29,000` per month.

LLM models differ along three axes, and the gaps between them are **enormous**:

ModelCost (input/1M)Cost (output/1M)Latency (TTFT)Quality (MMLU)
GPT-4oUSD 2.50USD 10.00~500ms88.7%
GPT-4o-miniUSD 0.15USD 0.60~200ms82.0%
Claude 3.5 SonnetUSD 3.00USD 15.00~600ms88.7%
Claude 3.5 HaikuUSD 0.80USD 4.00~300ms84.0%
Llama 3.1 70B (local)USD 0.00USD 0.00~150ms82.0%
Llama 3.1 8B (local)USD 0.00USD 0.00~50ms68.0%

GPT-4o costs **17x more** than GPT-4o-mini. At 100K requests per day, the math is merciless:

Model routing is not a cost-vs-quality trade-off. It is **optimization without quality loss**. GPT-4o-mini handles simple requests with the same result as GPT-4o. The difference only shows up on complex tasks: multi-step reasoning, intricate code, nuanced communication. Stanford RouteLLM and LiteLLM router confirm: 60-80% savings are achievable with less than 1% MMLU degradation.

Three routing strategies - they do not compete, they complement each other:

  1. **Complexity-based** - assess query complexity and choose a model (covered next)
  2. **Cost-based** - per-user/session budget limit, switch when approaching the cap
  3. **Latency-based** - for real-time apps choose a fast model; for batch - the highest quality one

With 100K requests per day, 70% of which are simple, model routing saves ~66% of the budget. How?

Complexity-Based Routing: Assessing Query Complexity

Query complexity is inherently subjective. "Translate 'hello' to French" - simple. "Analyze trade-offs between microservices and monolith for a startup with 3 engineers" - complex. The challenge: teach the system to tell the difference without human input, in under a millisecond.

Three approaches to complexity assessment. They form a cascade - from fast to accurate:

Approach 1: Heuristic-Based (rule-based) - keyword routing

Approach 2: LLM-Based Classifier - LLM routing

This is exactly what Stanford RouteLLM (2024) researched: a cheap classifier (gpt-4o-mini) decides whether the top-tier model is needed. The ROI of classification is 400x - the call costs `USD 0.000024`, while correctly routing a single request saves `USD 0.01+`.

Approach 3: Embedding-Based Classifier

In practice, a **cascade** works best: heuristic as a fast path (free, 0ms) - if score < 10 or score > 70, decision is instant. If 10-70, call the LLM classifier. LiteLLM router implements exactly this pattern out of the box - fallback chains with cost tracking at every step.

The query 'What is the capital of France?' goes through the complexity router. Which approach determines its complexity fastest and cheapest?

Cost-Based Routing: Budget Control

Complexity routing optimizes individual requests. But there are scenarios where the **budget at the user, session, or daily level** is what matters. A freemium service gives 10 GPT-4o requests for free, then switches to mini. An enterprise client has a monthly cap of `USD 500`, after which degradation kicks in.

Cursor does exactly this: free plan gets a local model, Pro gets GPT-4o-mini, Business gets GPT-4o. The code is the same - only the budget limit passed to the router differs. Cost per request tracking in Redis is the industry standard for B2C AI SaaS.

In production, cost per request tracking runs through Redis - atomic and fast:

Cost routing is critical for **B2C SaaS**: ChatGPT, Notion AI, Cursor. A user pays `USD 20/month`, but GPT-4o costs `USD 0.03` per request. At 100 requests per day - that is `USD 90/month` loss per user. A cost router switches to mini after a threshold and keeps unit economics from going underwater.

A user on the Pro plan (USD 5/day) has spent USD 4.85. A complex request costing USD 0.03 on GPT-4o comes in. What should the cost router do?

Latency-Based Routing: Choosing by Speed

In real-time applications - chatbots, IDE assistants, voice helpers - latency beats quality. Users accept a slightly weaker response at 200ms over a brilliant one at 3 seconds. **Latency routing** selects the model based on the acceptable response window.

Use CaseTarget LatencyOptimal ModelWhy
Autocomplete in IDE<200msLocal Llama 8BInstant response, acceptable quality
Chat reply (first token)<500msGPT-4o-mini / HaikuFast TTFT, solid quality
Document analysis<5sGPT-4o / Claude SonnetQuality over speed
Batch processingNo limitGPT-4o / Claude OpusMax quality, time is not critical
Voice assistant<300msGPT-4o-mini + streamingUX demands instant reaction

Provider latency is **unstable**. GPT-4o responds in 500ms in the morning and 3000ms in the evening (US peak traffic). The router must use a **sliding window p95** of recent requests - not fixed numbers from documentation. One timeout will not break the statistics. The arithmetic mean will.

The advanced strategy is **Adaptive Routing** based on **UCB1 (Upper Confidence Bound)** from multi-armed bandit theory. The system automatically balances exploitation (use the best provider) and exploration (try others - they might have improved):

A voice assistant requires TTFT < 300ms. GPT-4o shows p95 = 600ms, GPT-4o-mini shows p95 = 250ms. What is the router's decision?

Implementing a Universal Model Router

All three strategies - complexity, cost, latency - combine into one **Model Router**. It is the central component: receives a request with metadata, returns the optimal model. LiteLLM router and Portkey do exactly this - with a SaaS layer on top. Below is a production-ready implementation with no vendor lock-in.

Using the router in a NestJS service:

Services that implement model routing as a product - worth knowing:

ServiceWhat It DoesPricing Model
OpenRouterUnified API for 100+ models, routing by price/qualityPass-through + markup
MartianAI router with automatic model selectionSaaS, per-request
PortkeyGateway with fallback, load balancing, cachingOpen-source + cloud
LiteLLMUnified API + budget management + routingOpen-source

Start simple: heuristic complexity scorer + 2 models (GPT-4o and GPT-4o-mini). This covers 90% of model routing value. Embedding classifiers, adaptive UCB1, multi-provider fallback chains - those are second-order optimizations. Specialization + routing beats one "best" model for everything.

The Model Router receives a request: complexity=15, user budget is 95% exhausted, no target latency set. Which model will a correct implementation choose?

One best model for everything - it will handle any request well enough

Specialization + routing beats universality: different models are optimal for different tasks

GPT-4o-mini on simple questions achieves 98% of GPT-4o quality at 6% of the cost. Stanford RouteLLM, LiteLLM router, and real-world cases all confirm: 60-80% savings with proper routing and no perceptible degradation. 'One best model' means paying for a Ferrari to fetch groceries.

Summary

  • Model routing - a request director: simple → cheap model, complex → top-tier. Savings: 60-80%
  • Complexity routing: keyword routing (0ms, free) → LLM classifier (accurate, USD 0.00002) → embedding KNN
  • Cost routing: cost per request tracking in Redis, downgrade as the budget limit approaches
  • Latency routing: sliding window p95, fallback chains. Latency instability is the main enemy
  • Universal Router: complexity x cost x latency → weighted score → optimal model
  • Start with heuristic + 2 models (GPT-4o/mini). That is 90% of the value. Specialization + routing beats one best model

What's Next

Model routing optimizes model selection. The next topics cover optimizing the calls themselves: response caching, prompt compression, rate limit management.

  • Caching and Optimization — Semantic cache for LLM responses, prompt compression, KV cache - another 30-50% savings
  • Cost Management — A complete cost strategy: routing + caching + prompt optimization
  • Rate Limiting for AI — Token bucket, sliding window, per-user limits - protection against overspending

Связанные уроки

  • aie-21-orchestration-patterns — Routing is a specific orchestration pattern
  • aie-28-caching-optimization — Routing and caching together cut cost
  • aie-29-cost-management — Routing is the core lever of cost management
  • aie-30-rate-limiting-ai — Routing decisions interact with per-model limits
  • net-37-load-balancing — Model routing is load balancing across model backends
  • ml-05-evaluation — Routing by difficulty needs a query classifier
Model Routing: Choosing Models on the Fly - GPT-4 vs Claude vs Local Models

0

1

Sign In