Computer Vision

Vision-Language Models: CLIP, Flamingo and Grounded Understanding

CLIP achieves 76% on ImageNet with zero training on ImageNet data. The entire internet was its labeled dataset - every image with alt text, every product photo with a description, every meme with a caption. 400 million examples and no human labeler. This is what happens when contrastive learning meets web-scale data.

Google Lens: CLIP-descendant model grounds image regions to product queries - 1 billion monthly users finding products by pointing a camera
Bing Image Search: semantic CLIP embeddings replace keyword-only search - 'cozy coffee shop in rain' returns relevant images without exact tag matches
Waymo: VLMs flag rare edge cases in dashcam footage where closed-vocabulary detectors return empty boxes - VLM identifies 'debris on road partially obscured by shadows'
Scale AI: GroundingDINO + SAM automates 60% of annotation for object detection datasets - human annotators only review edge cases

Предварительные знания

Vision Transformers and how patch embeddings turn an image into tokens
Transformer attention and cross-attention between two token sequences
Embedding spaces, cosine similarity, and softmax-based classification

Five years that built vision-language models

CLIP landed in January 2021 from a team led by Alec Radford at OpenAI, proving that contrastive training on 400 million web image-text pairs could match a supervised ResNet-50 without task labels. In January 2022 Salesforce released BLIP, then BLIP-2 in 2023, which introduced the Q-Former to bridge a frozen vision encoder and a frozen language model. In April 2022 DeepMind published Flamingo, the first model to handle interleaved image-text sequences for few-shot visual learning. In 2023 the field moved fast: OpenAI added vision to GPT-4 (GPT-4V), and the open-source LLaVA showed that a single MLP connector trained on GPT-4-generated instructions could turn a frozen CLIP encoder and a LLaMA model into a capable visual assistant on a one-day budget.

CLIP: Contrastive Language-Image Pretraining

Google Images finds photos from a text query. How? Before 2021 the answer was tags and alt text. OpenAI's CLIP (2021) changed the approach: training on 400 million (image, text) pairs scraped from the internet, with no manual labeling. CLIP learned to compare images and text in one shared embedding space, so 'photo of a cat' lands close to the vectors for 'cat' and 'kitten'.

CLIP trains two encoders jointly: an ImageEncoder (ViT or ResNet) and a TextEncoder (Transformer). The objective is a contrastive loss (InfoNCE): an image embedding should sit close to the embedding of its own caption and far from the captions of other images in the batch. Given a batch of N pairs, the matching diagonal is maximized and all N^2 minus N off-diagonal pairs are pushed apart. The result is a general-purpose multimodal embedding space.

Why does CLIP not need labeled data to do zero-shot classification on new categories?

BLIP and BLIP-2: Image Captioning and VQA

CLIP compares, but it does not generate text. Instagram auto-generates alt text for images (accessibility). Google Lens answers questions about a photo. These are image captioning and Visual Question Answering tasks, and they need a generative language decoder.

**BLIP-2** (Salesforce, 2023) adds a Q-Former (Querying Transformer) as a bridge between a frozen image encoder (ViT-g, 1.8B params) and a frozen LLM (FlanT5/OPT). Only the Q-Former weights train (~188M): freeze the big models, train the small bridge.

What is the Q-Former in BLIP-2 and why is it needed?

Flamingo: Few-Shot Visual Learning

A doctor wants to ask a question about an X-ray and supply 3 examples of correct answers as context: few-shot learning with images. DeepMind's Flamingo (2022) was the first model to support interleaved image-text sequences: text, image, text, image, then a question and answer.

**Flamingo** combines a frozen Chinchilla LLM, a frozen CLIP vision encoder, and trainable Perceiver Resampler plus cross-attention layers inside the LLM. The Perceiver Resampler compresses a 2D feature map of any size into 64 fixed tokens. 80B parameters total.

Why does the Perceiver Resampler compress image features down to 64 fixed tokens?

GPT-4V and Modern Multimodal LLMs

OpenAI's GPT-4V (2023), Anthropic's Claude 3, and Google's Gemini Ultra are multimodal LLMs that look at an image and reason about it the way they reason about text. A developer uploads a screenshot of an error and GPT-4V reads the code, parses the stack trace, and suggests a fix. That is not just a caption, it is reasoning.

Modern open-source architecture: **LLaVA** is CLIP ViT + an MLP projection + Vicuna/LLaMA. **InstructBLIP** is BLIP-2 + instruction tuning. The key difference from Flamingo is instruction-following through RLHF/DPO. LLaVA-1.5 fine-tunes only the projection connector on 150K GPT-4-generated image-instruction pairs and reaches 85.9% on VQAv2.

Model	Company	Open source	Distinguishing trait
CLIP	OpenAI	Weights	Embeddings only, zero-shot
BLIP-2	Salesforce	Yes	Q-Former, captioning + VQA
Flamingo	DeepMind	No	Few-shot, interleaved sequences
LLaVA-1.6	Community	Yes	CLIP + LLaMA, instruction-following
GPT-4V	OpenAI	No	Reasoning, best accuracy
Gemini Ultra	Google	No	Native multimodal pretraining

Vision-Language models understand images the way a human does

VLMs do statistical pattern matching. GPT-4V can misread the numbers on a chart, hallucinate objects that are not there, and miss an anomaly in a medical image.

CLIP is trained on internet data where cats far outnumber X-rays. GPT-4V scores around 90% on OCR of natural images but around 60% on handwritten diagrams. Production use requires validation on domain-specific data.

What is the core difference between GPT-4V and CLIP when working with an image?

Key ideas

CLIP: contrastive loss on N image-text pairs per batch - N-1 negatives per anchor, 400M web pairs, no task labels
Zero-shot classification: encode class names as text, pick class with highest cosine similarity to image embedding
GroundingDINO: DINO + CLIP text grounding - detect any object described by natural language without retraining
LLaVA: frozen CLIP encoder + linear connector + LLM - fine-tuned only the connector on 150K GPT-4 instruction pairs
VLMs fail at precise spatial counting and geometry - specialized CV models still required for those tasks

Вопросы для размышления

Why does CLIP use InfoNCE loss with large batch sizes rather than triplet loss with hard negative mining?
Design a retrieval system for a medical imaging archive using CLIP embeddings - what are the failure modes compared to supervised classifiers?
When would a 7B LLaVA model be preferred over GPT-4V for a production CV task - cost, latency, or capability?

Связанные уроки

cv-16 — Self-supervised vision representations from lesson 16 are the image encoder in CLIP
cv-18 — VLMs are deployed in CV production systems covered in lesson 18
dl-05 — Transformer attention is the shared architecture for both image and text encoders in CLIP
ml-01-intro — Contrastive learning is a form of metric learning - shared conceptual foundation
dl-01