AI Engineering

AI Backend on Node.js/NestJS: architecture patterns, best practices

Цели урока

  • Understand why Node.js is the ideal AI backend platform - and where its limits are
  • Design a NestJS module for AI functionality with proper encapsulation
  • Implement LLM provider abstraction through an injectable service
  • Build a middleware pipeline: validation → rate limit → AI → output validation
  • Configure BullMQ for asynchronous AI tasks with priorities and retries
  • Master testing AI components: mocks, snapshot tests, integration tests

Node.js is the ideal backend for AI: the event loop handles long LLM requests perfectly, async/await is native, the ecosystem is massive. While GPT-4o spends 800 ms generating a response, the thread isn't blocked - it's handling 50 more requests. But there's a catch: CPU-bound tasks block the event loop. And tokenization with tiktoken is CPU-bound. It's 2025, every other startup is an AI wrapper. 90% of them are built like prototypes: one file, a direct OpenAI call, no rate limiting. The first traffic spike hits - a USD 10K bill, server down, users seeing 500 errors. The difference between a prototype and a production AI backend isn't the model - it's the architecture around it.

  • ChatGPT Plus handles millions of requests through BullMQ-style queues with prioritization - paid users get served first
  • Vercel AI SDK uses exactly this provider abstraction pattern - switch between OpenAI and Anthropic with one config line
  • Linear uses queues for AI ticket classification - 100K+ tasks per day with zero impact on the main application
  • tiktoken in a Worker Thread: on Node.js, tokenization is a CPU-bound operation that blocks the event loop - so it runs in a worker

How the JavaScript AI ecosystem came together

For a long time serious LLM work meant Python: that is where the SDKs, examples, and frameworks lived. JavaScript and TypeScript ran the frontend and the web backend, but had almost no AI tooling. That changed in 2022-2023. In late 2022 LangChain.js arrived, porting the ideas of orchestration, chains, and retrieval into the Node.js ecosystem so agent logic could be written in TypeScript. In 2023 Vercel, the creators of Next.js, released the Vercel AI SDK (installed as npm i ai). It solved a common task: streaming an LLM response into the UI and handling tool calling and structured output without a heavy framework, with ready helpers for React, Next.js, and Svelte. In parallel the official OpenAI and Anthropic TypeScript SDKs made model calls native to Node. JavaScript ended up with its own AI stack: the Vercel AI SDK for user-facing streaming interfaces and LangChain.js for orchestration and RAG. Node.js, with its event loop and native async/await, turned out to be a comfortable environment for a backend that spends much of its time waiting on a model response.

Предварительные знания

  • AI System Design: production AI application architecture from zero to scale

NestJS module architecture for AI services

Node.js is the ideal backend for AI. The event loop handles long LLM requests brilliantly: while GPT-4o spends 800 ms thinking, the thread isn't blocked - it's processing the next request. Async/await is native, the ecosystem is massive. But there's a catch: CPU-bound tasks block the event loop. And tokenization is CPU-bound. We'll get back to that.

AI functionality in a NestJS application lives in a dedicated **AiModule**. One module encapsulates everything: LLM providers, configuration, rate limiting, usage logging. Other modules get AI capabilities through service injection - without knowing the implementation details. This isn't formalism for its own sake: this boundary is what makes it possible to swap OpenAI for Anthropic in 10 minutes, tomorrow.

  • **One public service** - AiService. All internal providers are hidden
  • **@Global()** - AI is needed everywhere: chat, moderation, recommendations. One import in AppModule
  • **Configuration is isolated** - AiConfigService reads env, validates keys, selects the default model
  • **Queue is built-in** - BullMQ queue is registered inside the module, not externally
  • **Providers are interchangeable** - OpenAI, Anthropic, Ollama implement the same interface

**Configuration validation at startup** is critical. If an API key is missing, the application should crash immediately on start - not on the first user request 3 hours after deployment.

Why is AiModule marked with the @Global() decorator?

Injectable AI Service: provider abstraction pattern

AiService is a facade that hides the details of working with a specific LLM provider. External code calls `aiService.complete()` - and has no idea whether it's OpenAI or Anthropic. The provider is selected based on configuration, task type, or even request cost. Content moderation goes through gpt-4o-mini (USD 0.15 per 1M tokens), complex reasoning through claude-3-5-sonnet (USD 3 per 1M). A 20x gap - and users notice nothing.

**Strategy + Facade pattern:** AiService acts as the Facade (single entry point), while providers are the Strategy (interchangeable algorithms). This is exactly the pattern Vercel AI SDK uses: one config line switches the provider. Linear switched from OpenAI to their own models - users never noticed.

**Logging every call** isn't optional. Without `tokens`, `latencyMs`, `provider` metrics, there's no way to optimize costs. 100K requests per month through gpt-4o instead of gpt-4o-mini is a USD 188 difference. Without logs, it's invisible.

In the example, GPT-4o-mini is used for content moderation instead of GPT-4o. Why?

Middleware Pipeline: request → validate → rate-limit → AI → validate-output → response

An AI endpoint isn't just a proxy to OpenAI. Think of it as a border crossing: every layer does its job. Input validation cuts off prompt injection. Rate limiting stops abuse. Model routing picks the right provider. Output validation catches empty responses and oversized content. Without this pipeline, the first attacker with curl can run up a bill of thousands of dollars.

**Rate limiting for AI endpoints is mandatory.** Without it, a single user can generate a bill of thousands of dollars in minutes. Redis/KeyDB gives atomic counters with TTL - the only reliable approach in a multi-instance deployment.

Why is model OUTPUT validation (AiOutputValidationInterceptor) needed if the model generates text anyway?

BullMQ for asynchronous AI tasks

Not all AI tasks need an immediate response. Report generation, batch processing of 1000 texts, creating embeddings for a RAG system - these are all **background tasks**. BullMQ queues them with priorities, retry logic, and concurrency control. ChatGPT Plus works exactly this way: paid users get jobs with priority=1, free users wait at the back of the queue.

BullMQ ParameterRecommendation for AIWhy
concurrency3-10Limited by the LLM provider's rate limits
attempts2-3LLM API may return 429/503. More than 3 is pointless
backoffexponential, 5000msGives the API time to recover
timeout60000msLLM can take a while, but not forever

Why is BullMQ concurrency for an AI queue set to 3-10, not 100?

Testing an AI Backend: mocks, snapshots, integration tests

AI components are hard to test: LLM responses are non-deterministic, API calls cost money, latency is high. But that's no reason to skip tests. The solution is **multi-level testing**: unit tests with mocks cover 80% of the logic, snapshot tests lock down prompts, and integration tests with a real API run once a day in CI.

  • **Unit tests with mocks** - 80% of tests. Mock LlmProvider, test the logic around AI
  • **Snapshot tests for prompts** - catch accidental system prompt changes. The prompt is code
  • **Integration tests** - once a day in CI with a real API. Verify response format, not content
  • **Contract tests** - verify the response parses into the expected structure (JSON schema)
  • **Cost guard** - CI check that integration tests haven't exceeded the limit (e.g., USD 1 per run)

**Snapshot tests for prompts are an underrated practice.** An accidental change to a system prompt can break the behavior of all AI functionality. `toMatchSnapshot()` guarantees that prompt changes go through code review - not slip into production unnoticed.

Why use snapshot tests for system prompts?

Node.js isn't suitable for AI because of its single-threaded nature - everything will be slow

The event loop is ideal for I/O-bound LLM calls: while the model generates a response, the thread is free. The problem is only with CPU-bound preprocessing

An LLM call is just an HTTP request with a slow response. Node.js handles thousands of such requests in parallel without any blocking. The real problem is tokenization (tiktoken), JSON parsing of large contexts, and embedding computations. These are CPU-bound operations that actually block the event loop. The solution is Worker Threads: `new Worker('./tokenizer.worker.js')` moves heavy computation to a separate thread. The event loop stays free.

Summary

  • Node.js is the ideal AI backend platform: the event loop handles thousands of parallel LLM requests without blocking
  • CPU-bound tasks (tokenization via tiktoken, embedding computations) go into Worker Threads
  • AiModule with @Global() - a single entry point, AiService is the only public API, providers are hidden
  • LlmProvider interface enables switching OpenAI/Anthropic/Ollama without changing client code - and saves 20x with model routing
  • BullMQ: concurrency 3-10 (rate limits), exponential backoff, priorities (moderation > generation)
  • Testing: 80% unit with mocks, snapshots for prompts - the prompt is code

What's Next

The AI backend is built. Now - how to connect AI to external systems in a standardized way via MCP protocol, and how AI coding assistants use these same patterns.

  • MCP (Model Context Protocol) — A standard protocol for connecting AI to external systems - tools, resources, prompts
  • AI Coding Assistants from the Inside — How Copilot, Cursor, and Claude Code use AI backend architectural patterns

Связанные уроки

  • aie-42-ai-system-design — Backend implements the designed AI system
  • aie-45-mcp-protocol — Backend exposes tools via MCP
  • aie-43-realtime-ai — Node backend serves streaming responses
  • aie-35-observability — Instrument the backend for AI tracing
  • sd-10-microservices — Same service decomposition for AI features
AI Backend on Node.js/NestJS: architecture patterns, best practices

0

1

Sign In