Computer Vision

CV System Design: From Prototype to Production at Scale

Tesla Full Self-Driving processes 8 camera feeds at 36 Hz - 288 images per second per car, across 2 million vehicles on the road. That is 576 million images per second fleet-wide, fed through detection, segmentation, and depth networks in real time. CV system design is the discipline that makes that possible without burning 10 kW per car.

  • Waymo Data Engine: 1% of flagged uncertain frames labeled daily by contractors, 14 000 new training examples per day, weekly retraining cycle
  • Pinterest Visual Search: 600M images per month through CLIP embeddings + ANN search - 150ms p95 latency from upload to results
  • Triton at LinkedIn: 4 model instances per GPU, dynamic batching 5ms window - 3x higher GPU utilization vs Flask-based serving
  • Apple Vision Pro: on-device CV pipeline processes 12 cameras at 90 Hz on M2 chip Neural Engine with 15W total power budget

Предварительные знания

  • Image embeddings and CLIP from the vision-language lesson
  • Convolutional and transformer backbones for image encoding
  • Basic distributed systems ideas: caching, batching, latency budgets
  • Vision-Language Models
  • Image Classification with CNNs

How production CV pipelines took shape

Production computer vision changed shape over about a decade. Before 2015 most systems used hand-engineered features and per-task models trained from scratch. After deep CNNs won ImageNet, the dominant pattern became pretrained backbones plus fine-tuning: take a network trained on a large dataset, then adapt it to the target task with far less labeled data. The second shift was about where inference runs. Early systems sent every frame to a server, but mobile NPUs and edge accelerators made on-device inference practical, so latency-sensitive features like AR filters moved to the phone while heavy retrieval and moderation stayed in the cloud. By the early 2020s a production CV system was understood as a pipeline (ingestion, preprocessing, batched inference, post-processing, monitoring, retraining) rather than a single model, and serving stacks like Triton and TensorRT became standard tooling.

Visual Search System

Pinterest: a user uploads a photo of a dress and wants to find similar items across 500 million pins. Response latency has to stay under 200ms. How does that work? Not by brute force, but with embeddings plus approximate nearest neighbor search.

Visual search architecture has two halves. **Offline** (batch): generate CLIP embeddings for every image and index them in FAISS/Milvus. **Online** (real-time): for a query image, resize -> embedding -> ANN search -> re-ranking -> return top-K.

ComponentTechnologyJob
EmbeddingCLIP ViT-L/14512-1024 dim image representation
ANN IndexFAISS IVF-PQ / HNSWFind nearest of 500M in 5-10ms
Re-rankerCross-encoder / fine-tunedRefine top-200 down to top-20
StorageS3 + Redis cacheOriginals + embeddings
ServingTriton Inference ServerGPU batching for the encoder

Why does visual search use a two-stage search, ANN (top-200) + re-ranker (top-20), instead of an exact search up front?

Content Moderation System

Instagram receives 100 million photos a day. Manual moderation is impossible, so the system has to be automatic. Requirements: latency under 500ms, precision over 99% for prohibited content (CSAM), recall over 95% for nudity, and an audit trail for every decision.

Content moderation pipeline: a **multi-label classifier** (NSFW, violence, spam, hate) + a **perceptual hash** (PhotoDNA for CSAM) + a **VLM** for context (a weapon in a news photo vs a scene of violence). Fall back to a human moderator when the model's confidence is low.

Why does the moderation system run a separate PhotoDNA hash match alongside the ML classifier?

AR Filters System

The Snapchat dog filter places the ears and nose precisely on the face at any head angle, in real time, on a phone. 60 fps on an iPhone 12. No server request, everything runs on the device. How?

AR filter pipeline: **face detection** (BlazeFace, 200 fps on mobile) -> **face landmark detection** (468 points, MediaPipe Face Mesh) -> **pose estimation** (head rotation angle) -> **rendering** (overlay a 3D mask through OpenGL/Metal with correct perspective).

ComponentModelLatency (mobile)
Face DetectionBlazeFace (MobileNet-based)<1ms
Face LandmarkMediaPipe Face Mesh (478 pts)~3ms
Head PosesolvePnP (OpenCV)<1ms
RenderingOpenGL ES / Metal~5ms (GPU)
TotalEnd-to-end pipeline<16ms (60fps)

Why do Snapchat AR filters run entirely on-device with no server requests?

Production CV Pipeline Design

These are the common principles of production CV systems, independent of the task (search, moderation, AR). Which components always show up, where the typical bottlenecks are, and how to set up monitoring.

Key trade-offs in CV system design: **accuracy vs latency** (a heavier model is more accurate but slower), **batch vs real-time** (batch is cheaper, real-time needs GPU on-demand), **on-device vs server** (on-device for latency and privacy; server for quality and updates), **model size vs recall** (MobileNet vs ViT-L).

**Serving detail.** Dynamic batching is the single highest-impact throughput lever: a GPU runs near 60% efficiency at batch_size=1 and near 98% at batch_size=32, so Triton accumulates requests for a few milliseconds and runs them as one batch. INT8 TensorRT calibration on YOLOv8n moves 37.3 mAP to 36.8 mAP (-0.5%) for roughly 3x faster inference, after running 500-1000 representative images through the model to compute per-tensor scale factors.

For production CV it is enough to take a pretrained model and deploy it

Production CV requires choosing the right accuracy/latency trade-off, optimizing the model (TensorRT, quantization), two-stage pipelines (ANN + re-ranker), data drift monitoring, a retraining loop, and a fallback strategy.

A common mistake is picking ViT-L/14 (300ms inference) for a task that needs 50ms. Or ignoring data drift: a model trained on summer photos drops 15% in accuracy in winter. Production CV is system design, not just architecture selection.

Which CV model monitoring method catches degradation before users start complaining?

Key ideas

  • CV pipeline: ingestion -> preprocessing (letterbox) -> dynamic batching -> inference server -> NMS -> monitoring
  • Triton: multi-framework, dynamic batching, model ensembles - production standard at Waymo, LinkedIn, Booking.com
  • TensorRT INT8 calibration: 500-1000 representative images to compute per-tensor scales - typically -0.5% mAP for 3x speedup
  • Monitoring: embedding drift (KS test on PCA) + confidence histograms + detection count per image via Prometheus
  • Active learning: route uncertain predictions (confidence 0.3-0.6) to human review - 60% annotation cost reduction vs random sampling

Related topics

CV system design integrates model quality, hardware optimization, and operational monitoring.

  • Vision-Language Models — VLMs serve as high-accuracy fallbacks in cascade pipelines for rare or ambiguous detection cases
  • Self-Supervised Pretraining — DINO and MAE backbone quality determines the accuracy ceiling before any serving optimization

Вопросы для размышления

  • Design a cascade detector for medical X-ray pathology screening: Stage 1 must have near-zero false negatives, Stage 2 minimizes false positives - what architectures and thresholds?
  • How would an active learning system handle annotation for a model that detects rare events (1 positive per 10 000 frames) without drowning annotators in negatives?
  • A CV model is deployed in 15 countries with different camera hardware. How would the monitoring system differentiate device-specific drift from genuine distribution shift?

Связанные уроки

  • cv-17 — VLMs and detection models are the inference components deployed in this pipeline
  • cv-16 — Self-supervised pretraining determines the backbone quality of deployed models
  • dl-05 — Transformer inference optimization (KV-cache, quantization) applies to CV transformer backbones
  • ml-01-intro — Model evaluation metrics (mAP, precision-recall) are shared between classical ML and CV deployment
  • sd-01-intro
CV System Design: From Prototype to Production at Scale

0

1

Sign In