Computer Vision

Self-Supervised Vision: MAE, DINO, CLIP

GPT-4V understands images after 10 years of supervised learning - and 3 years of self-supervised. MAE (2021) trained on ImageNet without a single label in 31 hours on 64 GPUs. DINOv2 (2023) without any annotation outperformed many supervised models on depth estimation, segmentation, and retrieval. Self-supervised is not an alternative to supervised. It is the next level.

  • Meta DINOv2: produces segmentation masks through attention with zero labeled examples - used in Meta AR glasses
  • OpenAI CLIP: foundation of DALL-E 2 and GPT-4V - joint image-text embedding trained on 400M pairs
  • Google ALIGN: 1.8 billion noisy image-text pairs - outperforms CLIP on grounded visual tasks
  • Tesla FSD v12: self-supervised pretraining on 1 million hours of dashcam video without manual labels

Предварительные знания

  • Vision Transformer: patch tokenization, the [CLS] token, self-attention
  • Autoencoders and reconstruction loss (pixel MSE)
  • Cross-entropy, softmax with temperature, cosine similarity for contrastive losses
  • Image augmentations and the idea of transfer learning / fine-tuning
  • Vision Transformers
  • Object Tracking

The year self-supervised vision caught up with supervised

The modern self-supervised breakthrough in vision started in 2020 with two contrastive methods: MoCo (Kaiming He and co-authors, Facebook AI) with a queue of negatives and a momentum encoder, and SimCLR (Ting Chen and co-authors, Google) with large batches and strong augmentations. Both showed that features competitive with supervised training can be learned without a single label. In 2021 the approaches branched into families. Alec Radford and the OpenAI team released CLIP, training image and text encoders contrastively on 400 million internet pairs and unlocking zero-shot classification. Mathilde Caron and co-authors at Facebook AI introduced DINO: self-distillation without labels, whose ViT attention maps unexpectedly produced object segmentation masks. In late 2021 Kaiming He proposed MAE (Masked Autoencoders, published at CVPR 2022) - mask 75% of patches and reconstruct them, with the encoder processing only the visible 25%, yielding a 3-4x training speedup. By 2023 DINOv2 had narrowed the gap with supervised models to a couple of percent on ImageNet and surpassed it on several transfer tasks.

MAE: Mask 75% of the Image and Reconstruct It

2021. OpenAI trained DALL-E on 250 million labeled image-text pairs. Meta's response: MAE (Masked Autoencoders) - comparable transfer quality, zero manual labeling. The idea: mask 75% of image patches and make a ViT reconstruct them.

**MAE architecture.** The ViT encoder sees only visible patches (25% of the image). A lightweight decoder reconstructs the pixels of masked patches from context. Key insight: the encoder never sees mask tokens - no padding or empty tokens. 75% masking makes the task hard enough that the model must learn semantics, not textures.

**Why 75% and not 50%?** MAE authors (He et al. 2021) tested multiple ratios. At 50% the task is too easy - the model learns texture extrapolation. At 85% too little context remains to reconstruct details. 75% forces semantic understanding rather than texture copying.

Why does the encoder in MAE never see mask tokens (masked patches)?

DINO: Self-Distillation Without Labels

DINO (2021) discovered: training a ViT through self-distillation makes its attention maps produce accurate object segmentation masks - without a single mask in the training data. The [CLS] token encodes object semantics rather than scene statistics.

**Mode collapse and centering.** Without centering DINO collapses: teacher and student both predict the same class for all images. The fix: subtract an EMA center from teacher logits before softmax. This forces output diversity - analogous to BatchNorm but for the class distribution. DINOv2 (2023) added SwiGLU FFN and Layer Scale, improving ImageNet top-1 from 79.9 to 84.5.

Why is the teacher network in DINO updated via EMA rather than backpropagation?

CLIP: Image and Text in the Same Space

OpenAI CLIP (2021) was trained on 400 million image-text pairs from the internet. Result: zero-shot classification where the prompt 'a photo of a cat' is compared to an image embedding. ImageNet Top-1: 76.2% without a single ImageNet example during training.

**CLIP limitations.** CLIP struggles with fine-grained tasks: telling a Toyota Camry from a Honda Accord is beyond it. The reason: internet captions rarely describe car make and model in detail - they just say 'car' or 'vehicle'. ALIGN (Google, 2021) trained the same contrastive approach on 1.8 billion pairs - better on grounded tasks, noisier on zero-shot due to data quality.

How does CLIP classify an image in zero-shot mode without any examples of the new class?

Contrastive Learning: SimCLR, MoCo, and Negative Mining

SimCLR (Google, 2020) and MoCo (Meta, 2020) laid the foundation for self-supervised contrastive vision before DINO and MAE. The principle: two augmented views of the same image form a positive pair. All other images in the batch are negatives. The task: pull positives together, push negatives apart in embedding space.

**Collapse and solutions.** Without negatives SimCLR collapses: the model outputs the same vector for all images. BYOL (Bootstrap Your Own Latent) solves this without negatives via an asymmetric predictor. Barlow Twins uses feature decorrelation. MAE and DINO use reconstruction and distillation respectively. Collapse is the central challenge in SSL.

Self-supervised models are inferior to supervised ones

DINOv2 on ImageNet: 86.5% top-1 without labels vs 88.6% for supervised ViT-L - a 2% gap, not 10%. On transfer tasks DINOv2 often surpasses supervised models: pretrained on the internet without labels it learns richer features than a model optimized for 1000 ImageNet classes.

Labels are a bottleneck: they encode only what the annotator considered important. Self-supervised learning on 400M images produces richer features for downstream tasks.

Why does MoCo use a queue of negative examples rather than a large batch?

Key ideas

  • MAE: encoder sees 25% of patches, decoder reconstructs the other 75% - 3-4x faster training
  • DINO: student-teacher self-distillation with EMA teacher - attention maps reveal segmentation without labels
  • CLIP: contrastive image-text training on 400M pairs - zero-shot classification through text prompts
  • SimCLR/MoCo: augmentations of the same image = positive, all others = negatives - collapse without countermeasures
  • SSL vs Supervised: gap narrowed to 2% - self-supervised produces richer features for transfer

Related topics

Self-supervised learning builds on ViT architecture and leads to multimodal models.

  • Modern Architectures — ViT is the backbone for MAE and DINO: patch tokenization, position encoding, attention
  • Object Tracking — DeepSORT re-ID embeddings improve through SSL contrastive pretraining
  • Vision-Language Models — CLIP embeddings are the foundation for VLMs: DALL-E 2, GPT-4V, LLaVA

Вопросы для размышления

  • MAE reconstructs pixels; DINO predicts distributions - which pretext task better suits downstream segmentation and why?
  • CLIP is trained on internet text. How does this affect model bias and which downstream tasks suffer most?
  • DINOv2 outperforms supervised models on depth estimation. Why does a label-free model handle a task that requires 3D understanding better?

Связанные уроки

  • cv-15 — Re-ID embeddings from tracking improve through self-supervised contrastive learning
  • cv-05 — ViT architecture is the backbone for MAE and DINO
  • cv-17 — CLIP embeddings are the foundation for Vision-Language Models in the next lesson
  • cv-11 — Diffusion models use masked denoising - a reconstruction task analogous to MAE
  • dl-17 — Self-supervised pretraining is formalized in the DL track
  • ml-01
Self-Supervised Vision: MAE, DINO, CLIP

0

1

Sign In