AI Engineering

LLM API Integration: OpenAI, Anthropic, Open-Source Models

Цели урока

Learn to call the Chat Completions API from Node.js
Understand message roles (system, user, assistant) and context management
Implement streaming for real-time response display
Master error handling and retry strategies for LLM APIs
Design an abstraction for switching between providers

Предварительные знания

How LLMs Work Internally

LLM Fundamentals

The first OpenAI API call is 10 lines of code. The first production-ready call with retry, timeout, and fallback is 200 lines. That gap is what separates a demo from a product that survives a HackerNews spike. And the API key sitting in that code? Not just a string - it's direct access to a payment account. One accidental commit to a public repo, and bots will drain it within hours. This lesson is about the gap between "it works on my machine" and "it works at 3 AM when nobody's watching."

Notion AI - integrated GPT into an existing product, +USD 100M ARR in the first year, zero ML engineers on the team
Stripe uses GPT-4 for automatic support ticket classification - 40% less load on the support team
Duolingo switched error explanations to GPT-4 - saved months of content team work
Cursor (AI IDE) - USD 100M ARR built on solid LLM API architecture, not a proprietary model

The API That Opened an Industry

**2020**: OpenAI launched the world's first commercial LLM API on GPT-3. Models became accessible over HTTP - no GPU, no training, no PhD required. That was the moment AI stopped being a research tool and became a developer tool. **November 30, 2022**: ChatGPT as a public demo - 1 million users in 5 days. **March 1, 2023**: API for GPT-3.5-turbo at `USD 0.002` per 1K tokens - the true start of AI-as-a-Service as a market. Hundreds of thousands of developers signed up in the first week. By the end of 2023, more requests were going through the API than through ChatGPT itself. OpenAI didn't just build a model - it built a market.

Chat Completions API - The Core Interface

The first OpenAI API call is 10 lines of code. The first production-ready call with retry, timeout, and fallback is 200 lines. That gap is exactly what separates a demo from a product that doesn't crash at 2 AM.

All modern LLMs operate through the **Chat Completions API** - send an array of messages, receive a response. OpenAI, Anthropic, a local model via Ollama - the interface is nearly identical. Not a coincidence: OpenAI set the standard, and everyone else adopted it.

The response is not just text - it's an object with metadata. The most important field is `usage`:

**usage.total_tokens** - that's what appears on the bill. At 100,000 requests per day averaging 800 tokens each - that's 80M tokens per month. gpt-4o costs `USD 200`. gpt-4o-mini costs `USD 12`. Log usage in production from day one.

Which field in the Chat Completions API response shows the number of tokens spent?

Message Roles: system, user, assistant

The Chat API accepts an array of **messages**. Each message has a **role** - and that determines how the model interprets it.

Role	Purpose	When to Use
system	Instructions for the model. Sets behavior, tone, constraints	Once, as the first message
user	Message from the user	Each user request
assistant	Model's response (or example response)	Conversation history, few-shot examples

The model is **stateless**. It remembers nothing between requests - just like HTTP. The developer sends the full context every single time. That's not a bug - it's the architectural decision that makes the API horizontally scalable.

**The system prompt is the main lever.** A well-crafted system prompt turns a generic model into a specialized assistant. A sloppy one produces hallucinations and off-topic responses. Character.ai builds entire AI personas on system prompts - 20 billion messages per day. More on this in the prompt patterns lesson.

The model "remembers" previous messages in a conversation because:

Streaming: Real-Time Token-by-Token Responses

Without streaming, the user stares at a blank screen for 5-15 seconds while the model generates its full answer. With streaming, text appears token by token - exactly like ChatGPT. TTFT (time-to-first-token) drops from 10 seconds to 200-500ms. One boolean parameter changes how the entire product feels.

In **NestJS**, streaming is delivered to the client via Server-Sent Events (SSE) - a standard HTTP-based way to push a data stream without WebSocket:

**Streaming doesn't change cost** - token count is identical. But TTFT is the metric that separates a product that feels alive from one that feels frozen. Notion AI, Linear, Cursor - all use streaming exactly for that sense of immediate response.

Streaming in the LLM API is needed for:

API Errors: Retry, Rate Limits, Fallbacks

2023. A startup ships an AI feature. First 500 users - everything works. That night the product gets posted on HackerNews. Within an hour, 5,000 concurrent requests hit OpenAI directly. 429s start pouring in. The frontend shows a blank screen. The product goes offline. At 7 AM the founders open the logs and have no idea what happened.

An LLM API is an **external service**. It goes down, slows down, and returns errors. Production-ready integration must account for this from the very first commit.

HTTP Code	Cause	What to Do
400	Invalid request (bad prompt)	Fix the request. Don't retry
401	Invalid API key	Check .env. Don't retry
429	Rate limit - too many requests	Retry with exponential backoff
500	Error on OpenAI's side	Retry 2-3 times with delay
503	Service overloaded	Retry or fallback to another model

**OpenAI rate limits** depend on tier: from 500 RPM (requests per minute) on Tier 1 to 10,000 RPM on Tier 5. If the product is scaling - add a BullMQ request queue from the start, not after the first incident.

When receiving a 429 (Rate Limit) error from the LLM API, the correct strategy is:

OpenAI vs Anthropic vs Open Source: Choosing a Provider

Locking into a single provider is **vendor lock-in**. When OpenAI changed pricing in 2023, hundreds of companies discovered their unit economics had broken overnight. A smart AI engineer designs the system so switching providers takes 30 minutes, not 30 days.

**Key API differences:**

Aspect	OpenAI	Anthropic
System prompt	In messages array	Separate system field
Response	choices[0].message.content	content[0].text
Output token limit	max_tokens (optional)	max_tokens (required!)
Streaming	stream: true	stream: true
Function calling	tools + tool_choice	tools + tool_choice
Models	GPT-4o, GPT-4o-mini	Claude Sonnet, Claude Haiku

Start with OpenAI - more docs, more examples, more Stack Overflow answers. But write the abstraction immediately. Adding Anthropic or a local llama via Ollama will take 30 minutes instead of days of refactoring. This isn't overengineering - it's vendor lock-in insurance.

Why create an abstract LLMProvider interface instead of calling the OpenAI SDK directly?

An API key is just an authorization string

An API key is direct access to a payment account. A leak to a public repository means a bill for thousands of dollars within hours

Bots scan GitHub every few minutes for exposed API keys. OpenAI does not reimburse charges from leaks. The rule: API keys only in .env, .env in .gitignore, never in code, never in logs. For production - environment variables through a secret manager (AWS Secrets Manager, Vault, or similar).

Summary

Chat Completions API - the universal interface: messages in, text out. OpenAI set the standard; Anthropic and open-source adopted it
System prompt sets behavior, user = request, assistant = history. The model is stateless: every call is independent
Streaming via SSE gives TTFT of 200-500ms instead of 5-15s of waiting - the difference between a product that feels alive and one that feels frozen
429/500/503 - retry with exponential backoff (2s - 4s - 8s). 400/401 - fix the code, don't retry
LLMProvider abstraction - switch providers in 30 minutes, not days of refactoring
API key in .env, usage.total_tokens in logs - not optional, basic production hygiene

Вопросы для размышления

If the LLM API returns 429 three times in a row with exponential backoff - what next? Fallback to another model or return an error to the user?
At what daily request volume does it make sense to add a queue (BullMQ) between the API and LLM?
How is an API key protected in a Dockerized backend on a VPS - what's the path from .env to the container without baking the key into the image?

What's Next

The API is connected, responses are coming in, errors are handled. Next step - learn to write prompts that work reliably in production: not ad-hoc, but architecturally.

Prompt Patterns for Production — Architectural prompt patterns - few-shot, chain-of-thought, structured output
Streaming in depth — Advanced streaming patterns - backpressure, cancellation, WebSocket
Cost and Optimization — How not to go broke on LLM APIs - caching, model routing, prompt compression

Связанные уроки

aie-03-llm-fundamentals — How LLMs work underlies every API call
aie-08-streaming — Streaming responses build on the chat completions API
aie-07-structured-output — Function calling here unlocks typed JSON output
aie-29-cost-management — API usage drives token cost that needs control
net-21-http-basics — LLM APIs are HTTP request-response over JSON
sd-10-microservices — LLM provider is just another external service dependency