Computer Vision
Modern Architectures: EfficientNet, MobileNet, ConvNeXt, ViT
Предварительные знания
- CNN evolution from LeNet to ResNet: residual connections and bottleneck blocks (cv-04)
- BatchNorm, 1x1 convolutions, and the idea of reducing parameter count
- A basic grasp of self-attention to follow the Vision Transformer
Compound Scaling and the ConvNeXt Comeback
By 2019 the field had a recipe for accuracy: stack more layers, as VGG (Simonyan and Zisserman, 2014) and ResNet (He, Zhang, Ren, and Sun at Microsoft, 2015) had shown, while GoogLeNet/Inception (Szegedy et al., 2014) showed that smarter blocks could cut compute. The open question was how to scale a network efficiently. Mingxing Tan and Quoc V. Le at Google answered it with EfficientNet in 2019, introducing compound scaling that balances depth, width, and input resolution with a single coefficient. EfficientNet-B7 reached state-of-the-art ImageNet accuracy with roughly 8x fewer parameters than comparable models. When Vision Transformers arrived in 2020 and looked set to replace convolutional networks, a 2022 paper modernized a plain ResNet with transformer-era training tricks and design choices and named it ConvNeXt. ConvNeXt matched or beat ViT on most benchmarks, showing the convolution still had room to grow.
2019: EfficientNet-B7 hits ImageNet SOTA with 8x fewer parameters than comparable models. 2020: ViT beats all CNNs on ImageNet - but only when trained on 300M images. 2022: ConvNeXt applies the transformer recipe to a CNN and outperforms ViT on most benchmarks. Three years that rewrote the definition of a good visual architecture.
- **Google Lens**: MobileNet backbone for real-time on-device recognition - no round trip to a server, inference runs on the Neural Engine
- **Tesla Autopilot**: EfficientDet (EfficientNet backbone) for object detection across 8 cameras with end-to-end latency under 100 ms
- **Meta content moderation**: ConvNeXt-XXL processes 100M images per day - a ViT competitor without quadratic attention cost
- **Apple Face ID**: MobileNet-based architecture on the Neural Engine in under 1 ms - battery impact is negligible
EfficientNet: compound scaling and MBConv
2019. Google Brain. EfficientNet-B7 takes the ImageNet SOTA with 8x fewer parameters than comparable models. The core insight: **compound scaling** - depth, width, and input resolution scaled jointly through a single coefficient phi, rather than one axis at a time. Prior work scaled independently: ResNet added layers, WideResNet expanded channels. Scaling only depth while resolution stays fixed wastes capacity on low-resolution features. EfficientNet found that joint scaling delivers a better accuracy/FLOPS curve across all model sizes.
**Compound scaling law**: depth = alpha^phi, width = beta^phi, resolution = gamma^phi. Alpha, beta, gamma are found by grid search at phi=1, then phi is increased to produce B0-B7. The constraint alpha * beta^2 * gamma^2 = 2 ensures FLOPS double predictably at each step - resolution squared because it affects both height and width.
**MBConv (Mobile Inverted Bottleneck + Squeeze-and-Excitation)** is EfficientNet's building block. It expands channels by 6x, applies depthwise 3x3 conv, then projects back down. The SE module adds channel-wise attention: global average pooling -> FC(C/r) -> FC(C) -> sigmoid -> channel rescaling. SE blocks add < 1% of total parameters but boost accuracy by ~1-2%.
What is compound scaling in EfficientNet, and why does scaling only depth (as in ResNet) give a worse accuracy/FLOPS trade-off at a fixed compute budget?
MobileNet: depthwise separable convolutions for on-device inference
MobileNetV3-Small: 5.4M parameters, 2.5 MB. ResNet-50: 25M parameters, 100 MB. On an iPhone, MobileNet runs in 20 ms. ResNet-50 takes 200 ms. On-device CV requires different math. The key idea is **depthwise separable convolution**: split one standard convolution into two steps. Standard conv: H*W*C_in*C_out*K*K operations. Depthwise: H*W*C_in*K*K (each channel filtered independently). Pointwise 1x1: H*W*C_in*C_out. Total: roughly 8-9x fewer FLOPs for K=3.
**MobileNetV3** was found by **Neural Architecture Search (NAS)** - automated architecture optimization via reinforcement learning rather than manual design. Key v3 details: **Hard Swish** activation (x * relu6(x+3) / 6, a piecewise-linear approximation of Swish without the expensive exp), **h-sigmoid** for SE blocks, and se_ratio=0.25. NAS also tuned which blocks to use SE and which to skip.
Why is depthwise separable convolution faster than a standard convolution for the same input/output dimensions?
ConvNeXt: a CNN rebuilt with the Transformer recipe
2022. Facebook FAIR. ConvNeXt is a CNN that copied everything useful from the Transformer playbook: 4x4 non-overlapping patch embedding, depthwise 7x7 conv, inverted bottleneck, LayerNorm instead of BatchNorm, GELU instead of ReLU, fewer activations and normalizations per block. The result: the CNN beat ViT on most benchmarks. The implication is striking - the Transformer's edge in vision came not from attention itself, but from a set of engineering choices that transfer directly to convolutions.
**Why 7x7 depthwise conv?** It is the convolutional analog of the 7x7 local window in Swin Transformer. A larger receptive field per layer, but with linear - not quadratic - complexity in the number of spatial positions. This is one of the two biggest differences from ResNet (3x3 conv); the other is LayerNorm.
ConvNeXt replaced BatchNorm with LayerNorm. Why does this matter for matching Vision Transformer performance?
Vision Transformer: patches as tokens
2020. Google Brain. ViT beats CNNs on ImageNet - but the paper's subtitle reads: 'at sufficient scale'. The architecture is minimal: a 224x224 image is split into 16x16 patches (196 total). Each patch is linearly projected to an embedding vector. Learnable position embeddings are added, plus a [CLS] token prepended to the sequence. From there: a standard Transformer Encoder - multi-head self-attention and MLP. No convolutions. No pooling. The [CLS] token's output goes to the classification head.
**The data scale caveat**: ViT-L trained on ImageNet-1k (1.2M images) underperforms ResNet-50. The same ViT-L trained on JFT-300M (300M images) outperforms everything. CNNs embed **inductive biases** by design: translation invariance (weight sharing), locality (small kernels). A Transformer must learn these properties from data - requiring far more examples to discover that adjacent patches are more related than distant ones.
**Key ViT extensions**: DeiT (Facebook, 2021) - knowledge distillation from a CNN teacher, enabling ViT training on ImageNet-1k without JFT. Swin Transformer (Microsoft, 2021) - shifted window attention produces hierarchical features (like ResNet's stages), with linear complexity in image resolution instead of quadratic. Swin became the backbone of choice for detection and segmentation tasks in the transformer era.
Transformers always outperform CNNs for computer vision
ViT surpasses CNNs when pretrained on hundreds of millions of images. At dataset scales below 10M images, ConvNeXt and EfficientNet remain competitive or better - they encode inductive biases that the transformer must learn from scratch
DeiT (2021) showed that ViT-B without JFT-300M underperforms EfficientNet-B7 on ImageNet-1k. Swin partially addresses this with hierarchical local windows, but for specialized domains (medical imaging, industrial inspection) with fewer than 100K labeled examples, CNNs frequently still win
Why does ViT require far more training data than ResNet to achieve comparable accuracy?
Key Takeaways
- **EfficientNet**: compound scaling (depth + width + resolution jointly) with MBConv + SE attention delivers the best accuracy/FLOPS curve of its era
- **MobileNet**: depthwise separable conv cuts FLOPs by 8-9x for K=3; NAS discovered Hard Swish and block configurations optimized for Neural Engine hardware
- **ConvNeXt**: a CNN with transformer engineering details (7x7 dw conv, LayerNorm, GELU, inverted bottleneck) beats ViT on most benchmarks without quadratic attention
- **ViT**: patches as tokens + standard transformer encoder wins at 100M+ images; Swin adds hierarchy and linear complexity, enabling detection and segmentation use cases
Related Topics
Topics that feed into or extend modern vision architectures:
- CNN Architectures: LeNet to ResNet — EfficientNet, MobileNet, and ConvNeXt all build on the residual block and bottleneck design from cv-04
- Deep Learning: Transformers — ViT and Swin use the standard transformer encoder from dl-05 - attention, positional embeddings, LayerNorm
- Model Evaluation — Architecture selection depends on the accuracy/latency/memory trade-off from ml-05 - different hardware targets (Neural Engine vs GPU) change which model wins
Вопросы для размышления
- ConvNeXt copied transformer engineering details (LayerNorm, GELU, 7x7 window) and outperformed ViT. What does this imply about the source of the transformer's advantage - is it the attention mechanism, or is it the surrounding engineering choices?
- MobileNet was found by NAS rather than hand-designed. If NAS discovers better architectures automatically, why study design principles like compound scaling and depthwise separable convolutions?
- ViT needs 300M images to outperform CNNs, but ConvNeXt is competitive at 1M. How does this change architecture choice for medical imaging with 10,000 labeled scans?