Computer Vision

CV System Design: From Prototype to Production at Scale

Tesla Full Self-Driving processes 8 camera feeds at 36 Hz - 288 images per second per car, across 2 million vehicles on the road. That is 576 million images per second fleet-wide, fed through detection, segmentation, and depth networks in real time. CV system design is the discipline that makes that possible without burning 10 kW per car.

Waymo Data Engine: 1% of flagged uncertain frames labeled daily by contractors, 14 000 new training examples per day, weekly retraining cycle
Pinterest Visual Search: 600M images per month through CLIP embeddings + ANN search - 150ms p95 latency from upload to results
Triton at LinkedIn: 4 model instances per GPU, dynamic batching 5ms window - 3x higher GPU utilization vs Flask-based serving
Apple Vision Pro: on-device CV pipeline processes 12 cameras at 90 Hz on M2 chip Neural Engine with 15W total power budget

Предварительные знания

Image embeddings and CLIP from the vision-language lesson
Convolutional and transformer backbones for image encoding
Basic distributed systems ideas: caching, batching, latency budgets

How production CV pipelines took shape

Production computer vision changed shape over about a decade. Before 2015 most systems used hand-engineered features and per-task models trained from scratch. After deep CNNs won ImageNet, the dominant pattern became pretrained backbones plus fine-tuning: take a network trained on a large dataset, then adapt it to the target task with far less labeled data. The second shift was about where inference runs. Early systems sent every frame to a server, but mobile NPUs and edge accelerators made on-device inference practical, so latency-sensitive features like AR filters moved to the phone while heavy retrieval and moderation stayed in the cloud. By the early 2020s a production CV system was understood as a pipeline (ingestion, preprocessing, batched inference, post-processing, monitoring, retraining) rather than a single model, and serving stacks like Triton and TensorRT became standard tooling.

Visual Search System

Pinterest: a user uploads a photo of a dress and wants to find similar items across 500 million pins. Response latency has to stay under 200ms. How does that work? Not by brute force, but with embeddings plus approximate nearest neighbor search.

Visual search architecture has two halves. **Offline** (batch): generate CLIP embeddings for every image and index them in FAISS/Milvus. **Online** (real-time): for a query image, resize -> embedding -> ANN search -> re-ranking -> return top-K.

Component	Technology	Job
Embedding	CLIP ViT-L/14	512-1024 dim image representation
ANN Index	FAISS IVF-PQ / HNSW	Find nearest of 500M in 5-10ms
Re-ranker	Cross-encoder / fine-tuned	Refine top-200 down to top-20
Storage	S3 + Redis cache	Originals + embeddings
Serving	Triton Inference Server	GPU batching for the encoder

Why does visual search use a two-stage search, ANN (top-200) + re-ranker (top-20), instead of an exact search up front?

Content Moderation System

Instagram receives 100 million photos a day. Manual moderation is impossible, so the system has to be automatic. Requirements: latency under 500ms, precision over 99% for prohibited content (CSAM), recall over 95% for nudity, and an audit trail for every decision.

Content moderation pipeline: a **multi-label classifier** (NSFW, violence, spam, hate) + a **perceptual hash** (PhotoDNA for CSAM) + a **VLM** for context (a weapon in a news photo vs a scene of violence). Fall back to a human moderator when the model's confidence is low.

Why does the moderation system run a separate PhotoDNA hash match alongside the ML classifier?

AR Filters System

The Snapchat dog filter places the ears and nose precisely on the face at any head angle, in real time, on a phone. 60 fps on an iPhone 12. No server request, everything runs on the device. How?

AR filter pipeline: **face detection** (BlazeFace, 200 fps on mobile) -> **face landmark detection** (468 points, MediaPipe Face Mesh) -> **pose estimation** (head rotation angle) -> **rendering** (overlay a 3D mask through OpenGL/Metal with correct perspective).

Component	Model	Latency (mobile)
Face Detection	BlazeFace (MobileNet-based)	<1ms
Face Landmark	MediaPipe Face Mesh (478 pts)	~3ms
Head Pose	solvePnP (OpenCV)	<1ms
Rendering	OpenGL ES / Metal	~5ms (GPU)
Total	End-to-end pipeline	<16ms (60fps)

Why do Snapchat AR filters run entirely on-device with no server requests?

Production CV Pipeline Design

These are the common principles of production CV systems, independent of the task (search, moderation, AR). Which components always show up, where the typical bottlenecks are, and how to set up monitoring.

Key trade-offs in CV system design: **accuracy vs latency** (a heavier model is more accurate but slower), **batch vs real-time** (batch is cheaper, real-time needs GPU on-demand), **on-device vs server** (on-device for latency and privacy; server for quality and updates), **model size vs recall** (MobileNet vs ViT-L).

**Serving detail.** Dynamic batching is the single highest-impact throughput lever: a GPU runs near 60% efficiency at batch_size=1 and near 98% at batch_size=32, so Triton accumulates requests for a few milliseconds and runs them as one batch. INT8 TensorRT calibration on YOLOv8n moves 37.3 mAP to 36.8 mAP (-0.5%) for roughly 3x faster inference, after running 500-1000 representative images through the model to compute per-tensor scale factors.

For production CV it is enough to take a pretrained model and deploy it

Production CV requires choosing the right accuracy/latency trade-off, optimizing the model (TensorRT, quantization), two-stage pipelines (ANN + re-ranker), data drift monitoring, a retraining loop, and a fallback strategy.

A common mistake is picking ViT-L/14 (300ms inference) for a task that needs 50ms. Or ignoring data drift: a model trained on summer photos drops 15% in accuracy in winter. Production CV is system design, not just architecture selection.

Which CV model monitoring method catches degradation before users start complaining?

Key ideas

CV pipeline: ingestion -> preprocessing (letterbox) -> dynamic batching -> inference server -> NMS -> monitoring
Triton: multi-framework, dynamic batching, model ensembles - production standard at Waymo, LinkedIn, Booking.com
TensorRT INT8 calibration: 500-1000 representative images to compute per-tensor scales - typically -0.5% mAP for 3x speedup
Monitoring: embedding drift (KS test on PCA) + confidence histograms + detection count per image via Prometheus
Active learning: route uncertain predictions (confidence 0.3-0.6) to human review - 60% annotation cost reduction vs random sampling

Вопросы для размышления

Design a cascade detector for medical X-ray pathology screening: Stage 1 must have near-zero false negatives, Stage 2 minimizes false positives - what architectures and thresholds?
How would an active learning system handle annotation for a model that detects rare events (1 positive per 10 000 frames) without drowning annotators in negatives?
A CV model is deployed in 15 countries with different camera hardware. How would the monitoring system differentiate device-specific drift from genuine distribution shift?

Связанные уроки

cv-17 — VLMs and detection models are the inference components deployed in this pipeline
cv-16 — Self-supervised pretraining determines the backbone quality of deployed models
dl-05 — Transformer inference optimization (KV-cache, quantization) applies to CV transformer backbones
ml-01-intro — Model evaluation metrics (mAP, precision-recall) are shared between classical ML and CV deployment
sd-01-intro

Computer Vision

CV System Design: From Prototype to Production at Scale

Waymo Data Engine: 1% of flagged uncertain frames labeled daily by contractors, 14 000 new training examples per day, weekly retraining cycle
Pinterest Visual Search: 600M images per month through CLIP embeddings + ANN search - 150ms p95 latency from upload to results
Triton at LinkedIn: 4 model instances per GPU, dynamic batching 5ms window - 3x higher GPU utilization vs Flask-based serving
Apple Vision Pro: on-device CV pipeline processes 12 cameras at 90 Hz on M2 chip Neural Engine with 15W total power budget

Предварительные знания

Image embeddings and CLIP from the vision-language lesson
Convolutional and transformer backbones for image encoding
Basic distributed systems ideas: caching, batching, latency budgets

How production CV pipelines took shape

Visual Search System

Component	Technology	Job
Embedding	CLIP ViT-L/14	512-1024 dim image representation
ANN Index	FAISS IVF-PQ / HNSW	Find nearest of 500M in 5-10ms
Re-ranker	Cross-encoder / fine-tuned	Refine top-200 down to top-20
Storage	S3 + Redis cache	Originals + embeddings
Serving	Triton Inference Server	GPU batching for the encoder

Why does visual search use a two-stage search, ANN (top-200) + re-ranker (top-20), instead of an exact search up front?

Content Moderation System

Why does the moderation system run a separate PhotoDNA hash match alongside the ML classifier?

AR Filters System

The Snapchat dog filter places the ears and nose precisely on the face at any head angle, in real time, on a phone. 60 fps on an iPhone 12. No server request, everything runs on the device. How?

Component	Model	Latency (mobile)
Face Detection	BlazeFace (MobileNet-based)	<1ms
Face Landmark	MediaPipe Face Mesh (478 pts)	~3ms
Head Pose	solvePnP (OpenCV)	<1ms
Rendering	OpenGL ES / Metal	~5ms (GPU)
Total	End-to-end pipeline	<16ms (60fps)

Why do Snapchat AR filters run entirely on-device with no server requests?

Production CV Pipeline Design

For production CV it is enough to take a pretrained model and deploy it

Which CV model monitoring method catches degradation before users start complaining?

Key ideas

CV pipeline: ingestion -> preprocessing (letterbox) -> dynamic batching -> inference server -> NMS -> monitoring
Triton: multi-framework, dynamic batching, model ensembles - production standard at Waymo, LinkedIn, Booking.com
TensorRT INT8 calibration: 500-1000 representative images to compute per-tensor scales - typically -0.5% mAP for 3x speedup
Monitoring: embedding drift (KS test on PCA) + confidence histograms + detection count per image via Prometheus
Active learning: route uncertain predictions (confidence 0.3-0.6) to human review - 60% annotation cost reduction vs random sampling

Вопросы для размышления

Design a cascade detector for medical X-ray pathology screening: Stage 1 must have near-zero false negatives, Stage 2 minimizes false positives - what architectures and thresholds?
How would an active learning system handle annotation for a model that detects rare events (1 positive per 10 000 frames) without drowning annotators in negatives?
A CV model is deployed in 15 countries with different camera hardware. How would the monitoring system differentiate device-specific drift from genuine distribution shift?

Связанные уроки

cv-17 — VLMs and detection models are the inference components deployed in this pipeline
cv-16 — Self-supervised pretraining determines the backbone quality of deployed models
dl-05 — Transformer inference optimization (KV-cache, quantization) applies to CV transformer backbones
ml-01-intro — Model evaluation metrics (mAP, precision-recall) are shared between classical ML and CV deployment
sd-01-intro

CV System Design: From Prototype to Production at Scale

Предварительные знания

How production CV pipelines took shape

Visual Search System

Content Moderation System

AR Filters System

Production CV Pipeline Design

Key ideas

Related topics

Вопросы для размышления

Связанные уроки

CV System Design: From Prototype to Production at Scale

Предварительные знания

How production CV pipelines took shape

Visual Search System

Content Moderation System

AR Filters System

Production CV Pipeline Design

Key ideas

Related topics

Вопросы для размышления

Связанные уроки