Deep Learning
Neural Architecture Search
Every neural network architecture deployed at scale - ResNet, EfficientNet, MobileNet, BERT - was designed by human experts over months of iteration. Neural Architecture Search asks: what if a machine could search the architecture space automatically? In 2017, Google's NAS found a cell architecture better than any human-designed baseline, at the cost of 800 GPUs for 28 days. By 2020, one-shot NAS achieves the same quality in hours. Today, every major ML framework (AutoML, Google Vertex AI, AWS AutoPilot) includes NAS capabilities, and hardware companies like Qualcomm use NAS to find optimal architectures for each new chip design.
- **Google AutoML** uses NAS (specifically Neural Architecture Search based on EfficientNet's compound scaling) as the backbone of its Vertex AI AutoML service, automatically finding architectures for customer image classification tasks in hours instead of months of manual engineering.
- **Apple's Neural Engine** chip architecture was partially co-designed with NAS: Apple uses hardware-aware NAS to find neural network architectures that map optimally to the Neural Engine's memory hierarchy, achieving 5x better performance/watt than GPU-based inference.
- **Qualcomm's AI Model Efficiency Toolkit (AIMET)** uses NAS combined with quantization search to find architectures for Snapdragon's Hexagon DSP - enabling on-device AI at 200 TOPS/watt efficiency in 2024 flagship phones.
When a controller learned to design networks
Barret Zoph and Quoc Le at Google Brain launched modern NAS in 2016-2017 with 'Neural Architecture Search with Reinforcement Learning': an RNN controller proposed architectures, trained them, and used validation accuracy as a reward signal, all at the cost of hundreds of GPUs running for weeks. The brute-force expense pushed the field toward efficiency. Liu, Simonyan, and Yang introduced DARTS in 2018, relaxing the discrete search into a differentiable one and cutting the cost by orders of magnitude. Tan and Le's EfficientNet (2019) then showed that a small NAS-found base plus principled compound scaling beats hand-tuned giants.
Предварительные знания
Neural Architecture Search
Neural Architecture Search (NAS) automates the design of neural network architectures. The search space defines what architectures are possible (layer types, connections, widths); the search strategy explores this space (random, evolutionary, RL, gradient-based); the performance estimation strategy evaluates candidates without full training (weight sharing, early stopping, zero-cost proxies). Early NAS (Zoph & Le, Google Brain 2017) used RL on 800 GPUs for 28 days to find a CNN cell better than hand-designed networks on CIFAR-10.
The cost explosion of early NAS motivated efficient approaches. DARTS (Differentiable ARchiTecture Search, Liu et al. 2019) relaxes the discrete architecture choice to a continuous mixture: instead of choosing one operation (conv3x3, pool, skip), maintain learnable weights alpha_i over all operations, train alpha jointly with network weights via bilevel optimization, then discretize by keeping the highest-weight operation per edge.
The unreasonable effectiveness of random search in NAS (Li & Talwalkar, 2019) showed that on NASBench-101, a random search finds architectures within 1% of the best found by RL and evolutionary methods at 1% of the compute. This raises questions about whether NAS actually finds meaningfully better architectures or primarily optimizes hyperparameters.
What is DARTS's key innovation over RL-based NAS?
EfficientNet: Compound Scaling
EfficientNet (Tan & Le, Google 2019) introduced compound scaling: systematically scaling depth (layers), width (channels), and resolution (image size) together via a fixed ratio phi. The intuition: doubling all three dimensions together is more efficient than doubling any single dimension. The optimal phi is found by constrained NAS on a small baseline network (MobileNet-v3 inspired), then scaled using phi for EfficientNet-B0 through B7.
EfficientNet-B7 achieves 84.3% ImageNet top-1 accuracy with 66M parameters - at the time 8.4x smaller and 6.1x faster than the best competing model (GPipe) at similar accuracy. EfficientNet-V2 (Tan & Le, 2021) further improves by adding Fused-MBConv blocks and progressive learning with increasing image resolution during training.
EfficientNet training uses RandAugment (14 augmentation types, randomly sampled) and Mixup (linear interpolation of pairs of images and labels). Without these augmentations, EfficientNet-B7 drops from 84.3% to 82.1% - the regularization is as important as the architecture for small-to-medium scale models.
What is compound scaling in EfficientNet, and why is it better than scaling only depth?
Hardware-Aware NAS
Hardware-aware NAS optimizes for latency on a specific target device, not just parameter count or FLOPs. FLOPs are a poor proxy for actual latency: a depthwise convolution has fewer FLOPs than a pointwise convolution but may be slower on GPU due to memory bandwidth bottlenecks. MobileNets are fast on CPU but not always on GPU; conversely, standard convolutions are fast on GPU due to cuBLAS optimization but slow on edge devices.
MNASNet (Tan et al., Google 2019) searches for architectures on Pixel phones with real latency measurement as the reward signal: for each sampled architecture, deploy and time on the actual device, then use this latency as part of the multi-objective reward. This finds architectures that are Pareto-optimal for accuracy vs. device latency, rather than just FLOPs.
OFA (Once-for-All Network, Cai et al., MIT 2020) trains a single supernet that contains all sub-networks, then samples the appropriate sub-network at deployment time for a given latency target - without any retraining. This separates training cost (one supernet, 3-5 GPU days) from search cost (sampling, seconds).
Why is measuring actual device latency a better optimization target than FLOPs for hardware-aware NAS?
One-Shot NAS and Supernets
One-shot NAS trains a single supernet that encompasses all possible architectures in the search space, sharing weights across sub-architectures. During search, sub-architectures are sampled and evaluated by inheriting weights from the supernet - no retraining required. SNAS, GDAS, and Single-Path One-Shot (Guo et al., Megvii 2020) demonstrate that supernet weight sharing produces reliable rankings of sub-architectures, making search effectively free after supernet training.
The weight sharing assumption has limits: weights optimized for the full supernet may not be optimal for any specific sub-architecture, causing ranking inconsistency. Fairness in training (uniform path sampling, progressive shrinking) improves ranking correlation. OFA's progressive shrinking trains from the largest sub-network first, then progressively adds smaller sub-networks - producing supernets with 99%+ Kendall's Tau ranking correlation.
TuNAS (Bender et al., Google 2020) runs NAS inside a Transformer training loop, tuning the architecture during pretraining rather than after. The found architectures improve accuracy by 0.5-1% on downstream tasks vs. hand-designed variants, suggesting NAS and pretraining scale together.
NAS always finds better architectures than human experts
NAS found architectures are often comparable to carefully hand-designed ones at similar compute budgets, and random search matches complex NAS strategies at equal compute - NAS is most valuable for new hardware targets where human intuition is poor
Human experts have accumulated strong intuitions about which operations work well (depthwise convolutions, residual connections, attention) - NAS primarily helps in unexplored search spaces or hardware-specific optimization where those intuitions do not apply
What is the key advantage of one-shot NAS supernets over training each candidate architecture independently?
Key Ideas
- **DARTS** makes NAS differentiable via continuous relaxation of operation choices, reducing search from 22,000 to 4 GPU-days - but random search with good hyperparameters is a strong baseline.
- **EfficientNet** demonstrates that compound scaling (depth + width + resolution simultaneously) produces Pareto-optimal accuracy/efficiency curves, with B7 achieving 84.3% ImageNet at 66M params.
- **Hardware-aware NAS** measures actual device latency (not FLOPs) during search because FLOPs are a poor proxy for real inference speed on specific hardware targets.
Related Topics
NAS connects to model efficiency and deployment:
- Quantization and Pruning — NAS and quantization are often combined in co-design: search for architecture and quantization policy jointly to minimize latency and maximize accuracy on target hardware
- Transfer Learning and Fine-Tuning — NAS-found architectures (EfficientNet, MobileNet) are the most commonly used pretrained backbones for fine-tuning on downstream tasks with limited data
Вопросы для размышления
- A team needs to deploy a real-time image classifier on a Raspberry Pi 4 (ARM Cortex-A72, no GPU) at 30fps. Would they use NAS, EfficientNet, or manual architecture design, and what search target metric would be most appropriate?
- DARTS uses continuous relaxation then discretizes at the end - what is the discrepancy between the continuous training objective and the discrete final architecture, and how does it affect the quality of found architectures?
- When would hardware-aware NAS fail to find better architectures than EfficientNet, even when given the same compute budget?
Связанные уроки
- dl-19 — NAS and quantization both target efficient deployable models
- dl-11 — Transfer learning seeds search with strong baselines
- dl-20 — Found architectures feed production system design
- ml-43-hyperparameters — NAS generalizes hyperparameter search to architectures
- alg-32-branch-bound — Pruning the search space mirrors branch and bound
- rl-01 — Early NAS used reinforcement learning controllers to sample architectures