AI Engineering
LLM API Integration: OpenAI, Anthropic, Open-Source Models
Цели урока
- Learn to call the Chat Completions API from Node.js
- Understand message roles (system, user, assistant) and context management
- Implement streaming for real-time response display
- Master error handling and retry strategies for LLM APIs
- Design an abstraction for switching between providers
Предварительные знания
- How LLMs Work Internally
The first OpenAI API call is 10 lines of code. The first production-ready call with retry, timeout, and fallback is 200 lines. That gap is what separates a demo from a product that survives a HackerNews spike. And the API key sitting in that code? Not just a string - it's direct access to a payment account. One accidental commit to a public repo, and bots will drain it within hours. This lesson is about the gap between "it works on my machine" and "it works at 3 AM when nobody's watching."
- Notion AI - integrated GPT into an existing product, +USD 100M ARR in the first year, zero ML engineers on the team
- Stripe uses GPT-4 for automatic support ticket classification - 40% less load on the support team
- Duolingo switched error explanations to GPT-4 - saved months of content team work
- Cursor (AI IDE) - USD 100M ARR built on solid LLM API architecture, not a proprietary model
The API That Opened an Industry
**2020**: OpenAI launched the world's first commercial LLM API on GPT-3. Models became accessible over HTTP - no GPU, no training, no PhD required. That was the moment AI stopped being a research tool and became a developer tool. **November 30, 2022**: ChatGPT as a public demo - 1 million users in 5 days. **March 1, 2023**: API for GPT-3.5-turbo at `USD 0.002` per 1K tokens - the true start of AI-as-a-Service as a market. Hundreds of thousands of developers signed up in the first week. By the end of 2023, more requests were going through the API than through ChatGPT itself. OpenAI didn't just build a model - it built a market.
Chat Completions API - The Core Interface
The first OpenAI API call is 10 lines of code. The first production-ready call with retry, timeout, and fallback is 200 lines. That gap is exactly what separates a demo from a product that doesn't crash at 2 AM.
All modern LLMs operate through the **Chat Completions API** - send an array of messages, receive a response. OpenAI, Anthropic, a local model via Ollama - the interface is nearly identical. Not a coincidence: OpenAI set the standard, and everyone else adopted it.
The response is not just text - it's an object with metadata. The most important field is `usage`:
**usage.total_tokens** - that's what appears on the bill. At 100,000 requests per day averaging 800 tokens each - that's 80M tokens per month. gpt-4o costs `USD 200`. gpt-4o-mini costs `USD 12`. Log usage in production from day one.
Which field in the Chat Completions API response shows the number of tokens spent?
Message Roles: system, user, assistant
The Chat API accepts an array of **messages**. Each message has a **role** - and that determines how the model interprets it.
| Role | Purpose | When to Use |
|---|---|---|
| system | Instructions for the model. Sets behavior, tone, constraints | Once, as the first message |
| user | Message from the user | Each user request |
| assistant | Model's response (or example response) | Conversation history, few-shot examples |
The model is **stateless**. It remembers nothing between requests - just like HTTP. The developer sends the full context every single time. That's not a bug - it's the architectural decision that makes the API horizontally scalable.
**The system prompt is the main lever.** A well-crafted system prompt turns a generic model into a specialized assistant. A sloppy one produces hallucinations and off-topic responses. Character.ai builds entire AI personas on system prompts - 20 billion messages per day. More on this in the prompt patterns lesson.
The model "remembers" previous messages in a conversation because:
Streaming: Real-Time Token-by-Token Responses
Without streaming, the user stares at a blank screen for 5-15 seconds while the model generates its full answer. With streaming, text appears token by token - exactly like ChatGPT. TTFT (time-to-first-token) drops from 10 seconds to 200-500ms. One boolean parameter changes how the entire product feels.
In **NestJS**, streaming is delivered to the client via Server-Sent Events (SSE) - a standard HTTP-based way to push a data stream without WebSocket:
**Streaming doesn't change cost** - token count is identical. But TTFT is the metric that separates a product that feels alive from one that feels frozen. Notion AI, Linear, Cursor - all use streaming exactly for that sense of immediate response.
Streaming in the LLM API is needed for:
API Errors: Retry, Rate Limits, Fallbacks
2023. A startup ships an AI feature. First 500 users - everything works. That night the product gets posted on HackerNews. Within an hour, 5,000 concurrent requests hit OpenAI directly. 429s start pouring in. The frontend shows a blank screen. The product goes offline. At 7 AM the founders open the logs and have no idea what happened.
An LLM API is an **external service**. It goes down, slows down, and returns errors. Production-ready integration must account for this from the very first commit.
| HTTP Code | Cause | What to Do |
|---|---|---|
| 400 | Invalid request (bad prompt) | Fix the request. Don't retry |
| 401 | Invalid API key | Check .env. Don't retry |
| 429 | Rate limit - too many requests | Retry with exponential backoff |
| 500 | Error on OpenAI's side | Retry 2-3 times with delay |
| 503 | Service overloaded | Retry or fallback to another model |
**OpenAI rate limits** depend on tier: from 500 RPM (requests per minute) on Tier 1 to 10,000 RPM on Tier 5. If the product is scaling - add a BullMQ request queue from the start, not after the first incident.
When receiving a 429 (Rate Limit) error from the LLM API, the correct strategy is:
OpenAI vs Anthropic vs Open Source: Choosing a Provider
Locking into a single provider is **vendor lock-in**. When OpenAI changed pricing in 2023, hundreds of companies discovered their unit economics had broken overnight. A smart AI engineer designs the system so switching providers takes 30 minutes, not 30 days.
**Key API differences:**
| Aspect | OpenAI | Anthropic |
|---|---|---|
| System prompt | In messages array | Separate system field |
| Response | choices[0].message.content | content[0].text |
| Output token limit | max_tokens (optional) | max_tokens (required!) |
| Streaming | stream: true | stream: true |
| Function calling | tools + tool_choice | tools + tool_choice |
| Models | GPT-4o, GPT-4o-mini | Claude Sonnet, Claude Haiku |
Start with OpenAI - more docs, more examples, more Stack Overflow answers. But write the abstraction immediately. Adding Anthropic or a local llama via Ollama will take 30 minutes instead of days of refactoring. This isn't overengineering - it's vendor lock-in insurance.
Why create an abstract LLMProvider interface instead of calling the OpenAI SDK directly?
An API key is just an authorization string
An API key is direct access to a payment account. A leak to a public repository means a bill for thousands of dollars within hours
Bots scan GitHub every few minutes for exposed API keys. OpenAI does not reimburse charges from leaks. The rule: API keys only in .env, .env in .gitignore, never in code, never in logs. For production - environment variables through a secret manager (AWS Secrets Manager, Vault, or similar).
Summary
- Chat Completions API - the universal interface: messages in, text out. OpenAI set the standard; Anthropic and open-source adopted it
- System prompt sets behavior, user = request, assistant = history. The model is stateless: every call is independent
- Streaming via SSE gives TTFT of 200-500ms instead of 5-15s of waiting - the difference between a product that feels alive and one that feels frozen
- 429/500/503 - retry with exponential backoff (2s - 4s - 8s). 400/401 - fix the code, don't retry
- LLMProvider abstraction - switch providers in 30 minutes, not days of refactoring
- API key in .env, usage.total_tokens in logs - not optional, basic production hygiene
Вопросы для размышления
- If the LLM API returns 429 three times in a row with exponential backoff - what next? Fallback to another model or return an error to the user?
- At what daily request volume does it make sense to add a queue (BullMQ) between the API and LLM?
- How is an API key protected in a Dockerized backend on a VPS - what's the path from .env to the container without baking the key into the image?
What's Next
The API is connected, responses are coming in, errors are handled. Next step - learn to write prompts that work reliably in production: not ad-hoc, but architecturally.
- Prompt Patterns for Production — Architectural prompt patterns - few-shot, chain-of-thought, structured output
- Streaming in depth — Advanced streaming patterns - backpressure, cancellation, WebSocket
- Cost and Optimization — How not to go broke on LLM APIs - caching, model routing, prompt compression
Связанные уроки
- aie-03-llm-fundamentals — How LLMs work underlies every API call
- aie-08-streaming — Streaming responses build on the chat completions API
- aie-07-structured-output — Function calling here unlocks typed JSON output
- aie-29-cost-management — API usage drives token cost that needs control
- net-21-http-basics — LLM APIs are HTTP request-response over JSON
- sd-10-microservices — LLM provider is just another external service dependency