Natural Language Processing

NLP System Design

The gap between a working NLP model and a reliable production NLP system is enormous. Gmail's spam filter serves 1.8 billion users with 99.9% uptime and sub-100ms latency - not because the ML model is perfect, but because the serving infrastructure, fallbacks, monitoring, and human review pipeline are engineered for production. The same principle applies to every search engine, chatbot, and moderation system. FAANG NLP system design interviews test exactly this: can a candidate reason through the full stack, not just the model?

  • **Airbnb Search** uses a multi-stage NLP pipeline - query understanding, semantic retrieval (dense + sparse fusion), listing reranking, and explanation generation - serving 150 million users with p99 latency under 200ms and A/B tested model changes every week.
  • **Meta's content moderation** system applies 3-tier classification across 100 billion pieces of content per day across Facebook and Instagram, with 350+ specialized classifiers for different violation types and languages, handling 99.5% automatically before any human review.
  • **Stripe's fraud detection** uses NLP on transaction descriptions and merchant names in addition to structured signals - a BERT-based merchant category classifier runs on every transaction and feeds into the risk scoring pipeline, processing 250 million transactions per day.

Предварительные знания

  • Embeddings, bi-encoders, and approximate nearest neighbor search
  • Retrieval-Augmented Generation: retrieval, reranking, and grounding
  • Text classification, used for intent routing and content moderation tiers
  • Basic systems thinking: latency percentiles (p50/p99), throughput, caching, and fallbacks

0

1

Sign In

  • RAG: Retrieval-Augmented Generation
  • Sentence and Document Embeddings
  • Text Classification

From feature engineering to LLM-as-a-service

Production NLP changed shape three times in five years. Through the mid-2010s a typical pipeline was a chain of task-specific stages: tokenizer, hand-built features (TF-IDF, n-grams, POS tags), a classifier such as logistic regression or an SVM, and a separate model for each task. Every new task meant new feature engineering and a new labeled dataset. The first shift arrived with BERT in 2018: one pretrained transformer fine-tuned per task replaced most of the feature engineering, and a single architecture started covering classification, NER, and question answering. The second shift was scale and serving infrastructure, as GPU inference, ONNX, distillation (DistilBERT), and dynamic batching turned research models into low-latency services. The third shift, roughly 2020-2023, was LLM-as-a-service: instead of training and hosting a model per task, teams call a hosted API (OpenAI, Anthropic) or self-host an open model (Llama, Mistral) and shape behavior with prompts and RAG. The hard engineering moved up the stack from feature design to retrieval quality, context management, evaluation, cost control, and serving throughput, which is exactly what production system design is about today.

Semantic Search System Design

A production semantic search system combines sparse (BM25) and dense (bi-encoder) retrieval via Reciprocal Rank Fusion (RRF), followed by a cross-encoder reranker. The offline pipeline: document ingestion -> chunking -> embedding (bi-encoder) -> indexing into a vector store (Qdrant/Weaviate). The online pipeline: query -> parallel BM25 + ANN search -> RRF fusion -> rerank top-50 -> return top-10.

RRF fusion score for a document: sum of 1/(k + rank_i) across all retrieval systems, where k=60 is a smoothing constant. Documents appearing in top positions in both sparse and dense results get the highest fused scores. RRF outperforms learned fusion weights in low-training-data regimes and is robust to distribution shift.

Query expansion improves recall: generate 3-5 paraphrases of the original query using an LLM, run retrieval for each, and merge results before reranking. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query and retrieves using the answer's embedding - improving recall@10 by 8-12% on BEIR.

Why does combining BM25 and dense retrieval with RRF outperform either method alone?

Production Chatbot Design

A production LLM-based chatbot requires more than a raw API call: intent classification (route to specialized handlers), session context management (token budget), RAG for factual grounding, output guardrails, latency SLAs (p99 < 3s for conversational), and cost controls. Streaming (SSE/WebSocket) is essential for perceived responsiveness - users tolerate 3s total but not 3s of silence.

Context window management is the core engineering challenge for multi-turn conversations. Naive approach: send full history - fails after ~20 turns when context exceeds 128k tokens and cost becomes prohibitive. Better: sliding window (last N turns) + RAG for older turns + summary compression of old history. Production systems often maintain three tiers: recent turns (verbatim), summary (LLM-compressed), and long-term memory (user preference KV store).

LLM cost is dominated by output tokens in streaming deployments. Using gpt-4o-mini for intent classification and routine responses, reserving gpt-4o for complex reasoning, reduces cost by 60-80% with negligible quality loss for common user intents. Prompt caching (Anthropic, OpenAI) further reduces cost by 90% for repeated system prompts.

Why is context window management a critical engineering challenge in production multi-turn chatbots?

Content Moderation at Scale

Content moderation at platform scale (YouTube: 500 hours uploaded per minute, Twitter: 6,000 tweets/second) requires multi-tier ML systems: a fast keyword/regex blocklist (microseconds), a lightweight binary classifier (BERT-mini, milliseconds), a heavy classifier for borderline cases (BERT-large, 50-100ms), and human review for high-stakes or contested decisions.

False positive rate is the critical product metric: incorrectly removing legitimate content creates trust erosion and regulatory liability. False negative rate is the safety metric: missing harmful content creates harm at scale. These objectives conflict - the operating threshold is set based on regulatory context and platform values, not ML metrics alone.

Adversarial users exploit moderation systems via character substitution ('h@te', 'h3te'), multilingual switching, and code-switching within sentences. Robust moderation models are trained on these adversarial patterns and use Unicode normalization to collapse homoglyphs before classification.

Why does a multi-tier content moderation system use progressively larger models in later tiers?

Production NLP Pipeline Design

Production NLP pipelines must address model serving, versioning, monitoring, and failure modes. Key serving patterns: batching (group requests to improve GPU utilization, target ~95% GPU utilization vs. 20% for single-request serving), caching (LRU cache for common queries, especially important for system prompts), and model routing (small model for easy queries, large model for complex ones).

Model monitoring for NLP requires distribution shift detection: embedding drift (cosine distance between current query embeddings and training distribution), output distribution monitoring (answer length, top token probability, refusal rate), and periodic A/B evaluation against human labels. Silent failures are the hardest - a model degradation may not manifest in errors but only in user satisfaction metrics.

vLLM's PagedAttention manages the KV cache as paged memory (like OS virtual memory), enabling continuous batching and GPU utilization of 70-80% vs. 20-30% for naive serving. At production scale (1000+ QPS), this 3-4x throughput improvement directly reduces GPU costs.

Deploying an NLP model to production is primarily a model quality problem - once accuracy is good enough, deployment is straightforward

Production NLP systems spend most engineering effort on serving infrastructure, monitoring, fallback strategies, and silent failure detection - model quality is necessary but rarely sufficient

A model with 95% accuracy at 500ms p99 latency, no fallback for GPU failures, and no drift monitoring will fail silently in production within weeks as query distribution shifts or infrastructure degrades

What is continuous batching (used in vLLM) and why does it improve GPU utilization compared to static batching?

Key Ideas

  • **Semantic search** combines BM25 + dense retrieval via RRF fusion, followed by cross-encoder reranking - a 3-stage pipeline that outperforms either method alone by 10-15% NDCG@10.
  • **Production chatbots** require intent routing, 3-tier context management (recent + RAG + summary), streaming, and cost-model routing - raw LLM API calls are only the beginning.
  • **NLP system maturity** is measured by latency SLAs, fallback chains, embedding drift monitoring, and human evaluation sampling - not just offline benchmark accuracy.

Related Topics

NLP system design synthesizes the full curriculum:

  • RAG: Retrieval-Augmented Generation — The retrieval pipeline in search engines and chatbots implements RAG - the system design lesson operationalizes the RAG architecture at production scale
  • NLP at the Interview (FAANG) — System design questions at FAANG interviews expect candidates to specify the full architecture described here, including latency, scale, and monitoring

Вопросы для размышления

  • Design a content moderation system for a platform with 50 million daily posts that must comply with EU DSA (Digital Services Act) audit requirements - what changes does regulatory compliance impose on the architecture?
  • A production semantic search system shows declining user satisfaction (measured by click-through rate) over 6 months without any model changes. What are the most likely causes and how would each be diagnosed?
  • When does it make sense to train a domain-specific BERT-based classifier vs. use GPT-4 via API for a production NLP task? List the decision factors.

Связанные уроки

  • nlp-17 — RAG is a core building block of NLP systems
  • nlp-23 — System design questions dominate NLP interviews
  • nlp-12 — Serving BERT models trades latency for accuracy
  • ml-55-ml-system-design — Same design tradeoffs as general ML systems
  • aie-42-ai-system-design — LLM application architecture in engineering
NLP System Design