Deep Learning

Vision Transformers (ViT)

In 2020, a 'pure' Transformer architecture trained on images - with no convolutions, no spatial locality, no hand-crafted features - matched the best CNNs on ImageNet when given enough data. This was a shock: the dominant assumption for a decade was that vision required architectural inductive biases (locality, translation equivariance) that only CNNs could provide. ViT proved those biases could be learned from data at sufficient scale. Today, Google's image search, Tesla's driver assistance cameras, Meta's content understanding, and medical imaging AI all run on ViT variants. The architecture that changed computer vision shipped in a single 22-page paper.

  • **Google image search** uses ViT-based models (specifically, descendants of the original ViT trained on JFT-3B) for image understanding and multimodal retrieval - serving 1 billion visual searches per day with representations that generalize across 1000+ image categories.
  • **Meta's content understanding** infrastructure uses Swin Transformer and DINOv2 for image classification, object detection, and content moderation across 4 billion daily active users, processing 100+ million images and videos per day.
  • **Medical imaging AI** (Paige AI, PathAI) applies ViTs to gigapixel pathology slides, using hierarchical approaches (Swin) to handle the 100,000x100,000 pixel resolution that standard CNNs cannot process - achieving FDA-cleared cancer detection accuracy.

Предварительные знания

  • Self-attention, multi-head attention, and positional encoding from the Transformer
  • How CNNs build spatial inductive bias through convolution and pooling
  • Tokens, embeddings, and the [CLS] token in sequence models
  • Transformers
  • CNN: Convolutional Networks

An Image Is Worth 16x16 Words

In 2020 a Google Brain team led by Alexey Dosovitskiy published "An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale", introducing the Vision Transformer. Their move was to split an image into 16x16 patches, treat each patch as a token, and feed the sequence into a plain Transformer encoder with almost no vision-specific machinery. For a decade everyone assumed image models needed the built-in locality of convolutions; ViT showed that with enough data the network learns those biases on its own and matches the best CNNs on ImageNet.

Patch Embedding

ViT (Dosovitskiy et al., Google Brain 2020) applies the standard Transformer architecture directly to images by splitting the image into fixed-size patches (16x16 pixels) and treating each patch as a 'token'. A 224x224 image produces (224/16)^2 = 196 patches. Each patch is flattened and linearly projected to the model dimension D (e.g., 768 for ViT-Base) - this linear projection is the only vision-specific component.

The patch size is the key resolution-capacity tradeoff: ViT-B/16 (patch=16) produces 196 tokens per image; ViT-B/32 (patch=32) produces 49 tokens. Smaller patches improve fine-grained recognition but increase sequence length quadratically in self-attention cost.

How many patch tokens does ViT produce from a 224x224 image with patch size 16x16?

CLS Token and Classification

ViT prepends a learnable [CLS] token to the patch sequence, bringing the total sequence length to 197. After passing through L Transformer encoder layers, the [CLS] token's representation aggregates information from all patches via self-attention and is used as the image representation for classification. A linear head on the [CLS] token predicts class logits.

Positional embeddings are added to each patch embedding (including [CLS]) to inject spatial information - standard Transformers have no notion of order. ViT uses 1D learned positional embeddings over the flattened patch grid. 2D positional encodings (aware of row/column position) improve performance by ~0.3% but complicate the architecture.

Alternative to CLS token: global average pooling over all patch representations. GAP performs comparably to CLS on ImageNet but slightly worse on downstream tasks that require spatial awareness. DeiT (Facebook, 2021) found the CLS token approach more stable for distillation.

Why does ViT add positional embeddings to patch tokens?

Hybrid CNN-Transformer Architectures

Hybrid ViTs replace the linear patch embedding with a CNN feature extractor (ResNet-50 or similar), which processes the image into a feature map before passing it to the Transformer. This combines CNN's inductive biases (translation equivariance, local feature extraction) with Transformer's global attention. Hybrid models outperform pure ViT on smaller datasets and match pure ViT at scale.

Swin Transformer (Microsoft, 2021) introduced hierarchical windowed attention: patches are grouped into windows, attention runs within each window, and windows shift between layers to enable cross-window communication. This gives O(n) complexity vs. O(n^2) for full self-attention, and Swin is now the dominant backbone for detection and segmentation (COCO, ADE20k).

ConvNeXt (Facebook, 2022) showed that modernizing a ResNet with ViT design choices (larger kernels, fewer normalization layers, GELU activation, patchify stem) matches Swin Transformer performance. The 'pure CNN vs. pure Transformer' debate is settled: both converge to similar accuracy at the same compute budget.

What is the key computational advantage of Swin Transformer's windowed attention over standard ViT self-attention?

Scaling ViTs

ViTs follow power-law scaling: performance improves predictably with model size (parameters) and pretraining data. Google's ViT-G (1.8B parameters, JFT-3B dataset) achieves 90.45% top-1 on ImageNet. The key finding from the original ViT paper: transformers require larger datasets than CNNs to reach competitive performance - ViT-B needs ImageNet-21k (14M images) minimum, where ResNet needs only ImageNet-1k (1.3M).

DINOv2 (Meta, 2023) demonstrates that self-supervised pretraining at scale eliminates the labeled data requirement: ViT-L trained with DINO objectives on 142M curated internet images achieves 86.3% linear probing accuracy on ImageNet - matching supervised ViT-H with 10x fewer parameters. The representation generalizes remarkably across tasks without any fine-tuning.

FlashAttention (Dao et al. 2022) made ViT training practical at scale: by fusing the attention computation into a single GPU kernel with SRAM-based tiling, it reduces memory from O(n^2) to O(n) and achieves 2-4x speedup over standard attention. ViT-L training on a cluster of 512 A100s dropped from 3 days to 18 hours.

ViT replaced CNNs entirely and CNNs are no longer used for vision tasks

CNNs (EfficientNet, ConvNeXt) remain competitive with ViTs at similar compute budgets, especially for detection and segmentation; the field uses both architectures depending on task requirements and data scale

The ViT paper compared against a ResNet-50 baseline and required much more data - subsequent CNN redesigns (ConvNeXt) closed the gap, and hybrid architectures (Swin) showed the two approaches complement each other

Why do ViTs require larger pretraining datasets than CNNs to reach competitive performance?

Key Ideas

  • **ViT** splits images into 16x16 patches (196 tokens for 224x224), applies a standard Transformer encoder, and classifies using the [CLS] token - the only vision-specific component is the patch embedding projection.
  • **Scaling law**: ViT performance follows a predictable power law with model size and data scale - ViT-G (1.8B params, JFT-3B) reaches 90.45% ImageNet top-1, but requires labeled data at scale unlike self-supervised alternatives.
  • **Swin Transformer** dominates detection/segmentation via hierarchical windowed attention (O(n) vs O(n^2)), making ViTs practical for high-resolution tasks that vanilla ViT cannot handle.

Related Topics

ViT builds on Transformers and connects to self-supervised learning:

  • Transformers — ViT applies the identical encoder architecture from NLP Transformers to visual patch sequences - the same multi-head attention and feed-forward blocks
  • Self-Supervised Learning — DINO and MAE (Masked Autoencoders) are self-supervised objectives designed for ViT that eliminate the need for labeled pretraining data

Вопросы для размышления

  • For a real-time object detection system that must run at 30fps on an NVIDIA Jetson (edge device), would ViT-B/16, Swin-T, or EfficientDet be the better architecture choice and why?
  • How would a ViT architecture handle a 4000x4000 satellite image? What modifications would be necessary and what would their tradeoffs be?
  • DINOv2 achieves competitive performance without labeled pretraining. What does this imply about the role of labels vs. data scale in visual representation learning?

Связанные уроки

  • dl-06 — ViT applies the Transformer encoder directly to image patches
  • dl-04 — CNN spatial bias that ViT learns from data instead
  • dl-17 — Self-supervised DINO and MAE pretrain ViT without labels
  • ml-31-transformers — Same Transformer architecture studied in the ML course
  • cv-04 — Image classification is the core ViT vision task
  • la-07-matrix-multiply — Self-attention is built from matrix multiplications
  • la-01-vectors-intro
Vision Transformers (ViT)

0

1

Sign In