Computer Vision
Vision-Language Models: CLIP, Flamingo and Grounded Understanding
CLIP achieves 76% on ImageNet with zero training on ImageNet data. The entire internet was its labeled dataset - every image with alt text, every product photo with a description, every meme with a caption. 400 million examples and no human labeler. This is what happens when contrastive learning meets web-scale data.
- Google Lens: CLIP-descendant model grounds image regions to product queries - 1 billion monthly users finding products by pointing a camera
- Bing Image Search: semantic CLIP embeddings replace keyword-only search - 'cozy coffee shop in rain' returns relevant images without exact tag matches
- Waymo: VLMs flag rare edge cases in dashcam footage where closed-vocabulary detectors return empty boxes - VLM identifies 'debris on road partially obscured by shadows'
- Scale AI: GroundingDINO + SAM automates 60% of annotation for object detection datasets - human annotators only review edge cases
Предварительные знания
- Vision Transformers and how patch embeddings turn an image into tokens
- Transformer attention and cross-attention between two token sequences
- Embedding spaces, cosine similarity, and softmax-based classification
Five years that built vision-language models
CLIP landed in January 2021 from a team led by Alec Radford at OpenAI, proving that contrastive training on 400 million web image-text pairs could match a supervised ResNet-50 without task labels. In January 2022 Salesforce released BLIP, then BLIP-2 in 2023, which introduced the Q-Former to bridge a frozen vision encoder and a frozen language model. In April 2022 DeepMind published Flamingo, the first model to handle interleaved image-text sequences for few-shot visual learning. In 2023 the field moved fast: OpenAI added vision to GPT-4 (GPT-4V), and the open-source LLaVA showed that a single MLP connector trained on GPT-4-generated instructions could turn a frozen CLIP encoder and a LLaMA model into a capable visual assistant on a one-day budget.
CLIP: Contrastive Language-Image Pretraining
Google Images finds photos from a text query. How? Before 2021 the answer was tags and alt text. OpenAI's CLIP (2021) changed the approach: training on 400 million (image, text) pairs scraped from the internet, with no manual labeling. CLIP learned to compare images and text in one shared embedding space, so 'photo of a cat' lands close to the vectors for 'cat' and 'kitten'.
CLIP trains two encoders jointly: an ImageEncoder (ViT or ResNet) and a TextEncoder (Transformer). The objective is a contrastive loss (InfoNCE): an image embedding should sit close to the embedding of its own caption and far from the captions of other images in the batch. Given a batch of N pairs, the matching diagonal is maximized and all N^2 minus N off-diagonal pairs are pushed apart. The result is a general-purpose multimodal embedding space.
Why does CLIP not need labeled data to do zero-shot classification on new categories?
BLIP and BLIP-2: Image Captioning and VQA
CLIP compares, but it does not generate text. Instagram auto-generates alt text for images (accessibility). Google Lens answers questions about a photo. These are image captioning and Visual Question Answering tasks, and they need a generative language decoder.
**BLIP-2** (Salesforce, 2023) adds a Q-Former (Querying Transformer) as a bridge between a frozen image encoder (ViT-g, 1.8B params) and a frozen LLM (FlanT5/OPT). Only the Q-Former weights train (~188M): freeze the big models, train the small bridge.
What is the Q-Former in BLIP-2 and why is it needed?
Flamingo: Few-Shot Visual Learning
A doctor wants to ask a question about an X-ray and supply 3 examples of correct answers as context: few-shot learning with images. DeepMind's Flamingo (2022) was the first model to support interleaved image-text sequences: text, image, text, image, then a question and answer.
**Flamingo** combines a frozen Chinchilla LLM, a frozen CLIP vision encoder, and trainable Perceiver Resampler plus cross-attention layers inside the LLM. The Perceiver Resampler compresses a 2D feature map of any size into 64 fixed tokens. 80B parameters total.
Why does the Perceiver Resampler compress image features down to 64 fixed tokens?
GPT-4V and Modern Multimodal LLMs
OpenAI's GPT-4V (2023), Anthropic's Claude 3, and Google's Gemini Ultra are multimodal LLMs that look at an image and reason about it the way they reason about text. A developer uploads a screenshot of an error and GPT-4V reads the code, parses the stack trace, and suggests a fix. That is not just a caption, it is reasoning.
Modern open-source architecture: **LLaVA** is CLIP ViT + an MLP projection + Vicuna/LLaMA. **InstructBLIP** is BLIP-2 + instruction tuning. The key difference from Flamingo is instruction-following through RLHF/DPO. LLaVA-1.5 fine-tunes only the projection connector on 150K GPT-4-generated image-instruction pairs and reaches 85.9% on VQAv2.
| Model | Company | Open source | Distinguishing trait |
|---|---|---|---|
| CLIP | OpenAI | Weights | Embeddings only, zero-shot |
| BLIP-2 | Salesforce | Yes | Q-Former, captioning + VQA |
| Flamingo | DeepMind | No | Few-shot, interleaved sequences |
| LLaVA-1.6 | Community | Yes | CLIP + LLaMA, instruction-following |
| GPT-4V | OpenAI | No | Reasoning, best accuracy |
| Gemini Ultra | No | Native multimodal pretraining |
Vision-Language models understand images the way a human does
VLMs do statistical pattern matching. GPT-4V can misread the numbers on a chart, hallucinate objects that are not there, and miss an anomaly in a medical image.
CLIP is trained on internet data where cats far outnumber X-rays. GPT-4V scores around 90% on OCR of natural images but around 60% on handwritten diagrams. Production use requires validation on domain-specific data.
What is the core difference between GPT-4V and CLIP when working with an image?
Key ideas
- CLIP: contrastive loss on N image-text pairs per batch - N-1 negatives per anchor, 400M web pairs, no task labels
- Zero-shot classification: encode class names as text, pick class with highest cosine similarity to image embedding
- GroundingDINO: DINO + CLIP text grounding - detect any object described by natural language without retraining
- LLaVA: frozen CLIP encoder + linear connector + LLM - fine-tuned only the connector on 150K GPT-4 instruction pairs
- VLMs fail at precise spatial counting and geometry - specialized CV models still required for those tasks
Related topics
Vision-language models combine self-supervised vision pretraining with transformer language models.
- Self-Supervised Vision Pretraining — DINO and MAE representations are the image encoder backbone for CLIP and LLaVA
- CV System Design — VLMs are deployed as components in production CV systems for rare class detection and annotation
Вопросы для размышления
- Why does CLIP use InfoNCE loss with large batch sizes rather than triplet loss with hard negative mining?
- Design a retrieval system for a medical imaging archive using CLIP embeddings - what are the failure modes compared to supervised classifiers?
- When would a 7B LLaVA model be preferred over GPT-4V for a production CV task - cost, latency, or capability?
Связанные уроки
- cv-16 — Self-supervised vision representations from lesson 16 are the image encoder in CLIP
- cv-18 — VLMs are deployed in CV production systems covered in lesson 18
- dl-05 — Transformer attention is the shared architecture for both image and text encoders in CLIP
- ml-01-intro — Contrastive learning is a form of metric learning - shared conceptual foundation
- dl-01