Deep Learning

Self-Supervised Learning

Humans learn to understand the visual world without labels - a child learns what a dog is by seeing thousands of dogs in various contexts, not from a dataset of (image, 'dog') pairs. Self-supervised learning is deep learning's attempt to replicate this. In 2020-2023, SSL went from a research curiosity to the dominant pretraining paradigm for vision: MAE, DINO, and SimCLR models now outperform supervised ResNets as feature extractors. DINOv2 features are used in production at Meta, in robotics at Boston Dynamics, and in medical imaging - without any labeled pretraining data. The shift from supervised to self-supervised pretraining is the biggest change in computer vision since convolutional networks.

**Meta's foundation models** use DINOv2-based pretraining as the visual backbone for content understanding across Instagram, Facebook, and WhatsApp - 4 billion users' images are understood by a model that learned from unlabeled web images.
**Tesla's Full Self-Driving** uses contrastive pretraining on dashcam footage (unlabeled) to learn robust road scene representations, reducing the labeled data needed for downstream tasks like lane detection and pedestrian recognition by 10x.
**Medical imaging AI** (PathAI, Paige AI) applies MAE-style pretraining on millions of unlabeled histopathology slides, then fine-tunes on hundreds of labeled examples per task - making AI-assisted cancer diagnosis practical at hospitals without large annotation budgets.

From word2vec to a contrastive explosion

Self-supervision started winning in language first. Tomas Mikolov's word2vec (2013) learned word meanings purely from context prediction, with no labels. Devlin and colleagues at Google generalized this to masked language modeling with BERT (2018), and the idea then crossed into vision. The year 2020 was the inflection point: SimCLR (Chen et al., Google) and MoCo (He et al., Facebook) made contrastive learning work on images, and BYOL (Grill et al., DeepMind) showed it even works without negative pairs. Within two years, label-free pretraining went from a curiosity to the default.

Предварительные знания

Contrastive Learning

Contrastive learning trains encoders to produce similar representations for different views of the same instance (positive pairs) and dissimilar representations for different instances (negative pairs). SimCLR (Chen et al., Google 2020) applies two random augmentations to each image, creates a batch of N images with 2N augmented views, and trains with NT-Xent loss: maximize similarity of the 2N/2 positive pairs relative to the 2N*(2N-1)/2 - N negative pairs.

SimCLR's critical finding: larger batches are better because they provide more hard negatives per step. A batch of 4096 images gives 8190 negative pairs per positive - infeasible on single-GPU training. MoCo (He et al., Meta) addressed this with a momentum encoder and queue of past embeddings as negatives, enabling large effective batch sizes on single machines.

Why do contrastive learning methods like SimCLR benefit from larger batch sizes?

Masked Image Modeling

Masked Image Modeling (MIM) extends BERT's masked token prediction to vision: randomly mask a large fraction of image patches and train the model to reconstruct them. MAE (Masked Autoencoders, He et al., Meta 2021) masks 75% of patches, encodes only the visible 25% with a ViT encoder, and uses a lightweight decoder to reconstruct the masked patches. The high masking ratio forces the encoder to learn rich semantic representations.

MAE is dramatically more compute-efficient than contrastive learning: by encoding only 25% of patches, it runs 3x faster per epoch than SimCLR. ViT-H pretrained with MAE then fine-tuned on ImageNet achieves 87.8% top-1, outperforming SimCLR at equivalent scale. The reconstruction target also matters: BEiT (Microsoft) reconstructs discrete VQ-VAE tokens rather than raw pixels, improving semantic quality.

MAE's 75% masking ratio is much higher than BERT's 15% token masking. This high ratio makes image reconstruction a hard task that cannot be solved by interpolating adjacent visible patches - forcing the encoder to learn semantic content. Lower masking (25-50%) produces noticeably worse representations.

Why does MAE use a 75% masking ratio rather than BERT's 15%?

BYOL: No Negative Pairs

BYOL (Bootstrap Your Own Latent, Grill et al., DeepMind 2020) showed that self-supervised learning does not need negative pairs - eliminating the large-batch requirement of SimCLR. BYOL uses two networks: an online network (trained by gradient descent) and a target network (exponential moving average of the online network). The online network predicts the target network's representation of a different view - without any negatives.

The mystery of BYOL: without negatives, why doesn't the model collapse to a constant representation where everything maps to the same vector? The answer: the momentum target network (which lags behind) and the asymmetric predictor head (only on the online network) create enough asymmetry to prevent collapse. Batch normalization in the projector head also plays a role.

SimSiam (Chen & He, Meta 2021) simplified BYOL further: remove the EMA target, use a stop-gradient on one branch, and a predictor on the other. This works without momentum and without negative pairs, achieving 71.3% linear eval - demonstrating that stop-gradient alone is sufficient to prevent collapse.

How does BYOL avoid representational collapse (all embeddings mapping to the same vector) without using negative pairs?

DINO: Self-Distillation with No Labels

DINO (Self-Distillation with No Labels, Caron et al., Meta 2021) applies knowledge distillation without a supervised teacher: a student network (online) is trained to predict the output of a teacher network (EMA), using multiple local and global crops of the same image. The key novelty: centering (subtracting a running mean) and sharpening (low-temperature softmax) prevent collapse without negative pairs.

DINO with ViT-S produces qualitatively remarkable attention maps: the attention heads learn to segment foreground objects without any segmentation supervision. When visualized, the attention maps look like semantic segmentation masks - a property that does not emerge from supervised training. DINOv2 (Oquab et al., 2023) scales this to ViT-G with 142M curated images, producing universal vision features used in robotics, medical imaging, and SLAM.

DINOv2 features are used as frozen backbones in robotics (Meta's project), SLAM (simultaneous localization and mapping), and medical imaging - domains where labeled data is scarce. The features generalize across distribution shifts that supervised models fail on, suggesting genuine scene understanding rather than dataset-specific pattern matching.

Self-supervised learning requires massive datasets (>100M images) to produce useful representations

SimCLR achieves 76.5% linear probing accuracy on ImageNet-1k (1.3M images) trained for 1000 epochs - supervised ResNet-50 achieves the same score, making SSL competitive at moderate data scales

SSL does require more training epochs than supervised learning (to compensate for weaker per-sample signal), but it does not require structurally more data - the advantage appears when labeled data is scarce in the target domain

What emergent capability of DINO-ViT models was surprising and not present in supervised ViT models?

Key Ideas

**Contrastive learning** (SimCLR, MoCo) learns representations by attracting positive (same image, different augmentation) and repelling negative (different image) pairs - effective but requires large batches or memory banks.
**Masked Image Modeling** (MAE, BEiT) reconstructs masked patches from visible ones - compute-efficient (3x faster than SimCLR) and achieves SOTA on ImageNet fine-tuning at scale.
**BYOL/DINO** eliminate negative pairs entirely via momentum targets and knowledge distillation - DINO uniquely produces attention maps that segment objects without any supervision.

Вопросы для размышления

A medical imaging startup has 10 million unlabeled chest X-rays and 500 labeled X-rays for pneumonia detection. Which SSL approach would be most suitable as pretraining, and what fine-tuning strategy would follow?
DINO attention maps segment objects without labels, but the model has no mechanism that explicitly targets segmentation. Why might the self-distillation objective lead to this emergent capability?
SimCLR requires a batch of 4096 images on 8 TPUs to achieve its best performance. How does MoCo solve the same problem on a single GPU, and what is the tradeoff?

Связанные уроки

dl-07 — Vision Transformers are the backbone for masked image modeling
dl-11 — Pretrained representations transfer to downstream fine-tuning
dl-13 — Autoencoders share the reconstruct-to-learn objective with MIM
ml-35-word-embeddings — Word2vec learns representations without labels, like contrastive learning
aie-09-embeddings — Self-supervised encoders produce embeddings used in retrieval
cv-08 — Contrastive pretraining boosts vision tasks with few labels
la-01-vectors-intro