Deep Learning

DL at the Interview (FAANG)

The gap between passing a DL interview at Google, OpenAI, or Meta and failing it is usually not knowledge of the latest paper or the exact parameter count of GPT-4. It is the ability to reason demonstrably under time pressure about ambiguous problems where multiple correct answers exist. The candidate who says 'I would use a ResNet' and stops is rated lower than the one who says 'I would start with a ResNet-50 baseline at 76% ImageNet accuracy because it is well-understood, then diagnose whether the bottleneck is model capacity, data volume, or augmentation strategy before committing to a larger architecture.' This lesson teaches that reasoning framework.

**Google ML Engineer L5 interviews** include one architecture design question where candidates design a real production system (search ranking, content recommendation, image understanding) and one ML optimization question where they debug a realistic training failure scenario - both evaluated on reasoning depth, not specific choices.
**OpenAI Research Engineer interviews** focus heavily on scaling estimation: 'how much compute does it take to train GPT-4?' and 'given this architecture, estimate memory requirements for training at 128 GPUs' - testing whether candidates can reason quantitatively about systems they have not directly worked on.
**Meta's production ML interviews** include a system design session where candidates design the full ML pipeline for a recommendation system serving 3 billion users - covering model choice, feature engineering, training infrastructure, serving, A/B testing, and monitoring - in 45 minutes.

AlexNet and the deep learning hiring boom

In 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won ImageNet with AlexNet, halving the error rate of the best classical methods and igniting the deep learning boom. The hiring market followed: Google acquired Hinton's startup in 2013, and within a few years 'machine learning engineer' became one of the most contested roles in tech. The interview format evolved with it. Early on, candidates were asked to derive backpropagation by hand; today the bar is system design and quantitative reasoning about scaling, because frameworks made the math routine but production reliability stayed hard.

Предварительные знания

Architecture Design Questions

FAANG DL architecture interviews ask candidates to design a neural network for a real problem from scratch. The expectation is not just naming an architecture, but justifying each choice: why this input representation, why this loss function, why this architecture family, and what failure modes exist. Common formats: 'Design a system that identifies inappropriate content in videos', 'Design a face recognition system for 1 billion users', 'Design an ad click-through rate prediction model'.

The canonical framework for architecture questions: (1) clarify requirements and constraints (latency, data scale, label availability); (2) define input/output format; (3) propose and justify baseline model; (4) identify limitations and propose improvements; (5) discuss training data, evaluation, and deployment. Interviewers credit reasoning quality over specific choices.

Architecture design interviews reward candidates who discuss failure modes unprompted: 'This model will struggle with small text under 8px because the stride in early layers loses spatial resolution - I would use FPN (Feature Pyramid Network) to handle multi-scale text.' Proactive failure analysis signals senior-level thinking.

When designing a DL system for a FAANG interview, what is the most common mistake candidates make?

Optimization and Training Questions

DL optimization interview questions test understanding of why training fails and how to fix it. Common question patterns: 'Your training loss is not decreasing - what do you check first?', 'Validation loss increases while training loss decreases - what happened?', 'Training is 5x slower than expected on your GPU cluster - diagnose'. Each requires systematic debugging, not random hyperparameter changes.

The systematic debugging order for training failures: (1) verify the loss can reach zero on a single batch (sanity check); (2) verify data loading and label correctness; (3) check gradient norms (vanishing or exploding); (4) profile GPU utilization (data loading bottleneck?); (5) check learning rate (too high = divergence, too low = no progress). This order prevents spending hours on LR tuning when the issue is label noise.

The 'overfit one batch' test is the single most valuable debugging technique in deep learning: set batch_size=1 and train for 1000 steps. If loss reaches ~0, the model and loss function are correctly implemented. If loss plateaus far from zero, there is a model or implementation bug to find before any other debugging.

Training loss decreases normally but validation loss starts increasing after epoch 5. What is the most likely cause?

Scaling Deep Learning Systems

Scaling questions test whether candidates understand the compute, memory, and data requirements of large models. Typical formats: 'Estimate the training compute for a 1B parameter model on 100B tokens', 'How many GPUs do you need to train GPT-3?', 'What changes if you want to scale from 7B to 70B parameters?'. These require back-of-envelope estimation with demonstrably stated assumptions.

Chinchilla scaling laws (Hoffmann et al., DeepMind 2022): the optimal training compute allocation is C = 6 * N * D where N is parameters, D is tokens, and C is total FLOPs. The optimal ratio is N:D = 1:20 (tokens per parameter). A 7B parameter model should be trained on 140B tokens for compute-optimal results. GPT-3 (175B params, 300B tokens) was undertrained by Chinchilla standards.

Memory scaling: doubling model parameters doubles memory for parameters, gradients, and optimizer states. Activations scale with batch size and sequence length. For a 70B model, even a single forward pass requires tensor parallelism across 4+ A100 80GB GPUs - fitting the model alone requires 140GB in BF16.

According to Chinchilla scaling laws, what is the compute-optimal tokens-to-parameters ratio for training a large language model?

DL Engineering Tradeoffs

DL tradeoff questions test whether candidates can reason quantitatively about cost-quality-latency. Common formats: 'You have a 70B model and need 100ms inference latency - what do you do?', 'Your team proposes switching from PyTorch to JAX for training - what is the tradeoff?', 'Training a larger model from scratch vs. fine-tuning a smaller pretrained model - when does each win?'. Vague answers fail; quantitative reasoning wins.

The most tested tradeoff: model size vs. inference cost. A 70B model is ~10x more expensive to serve than a 7B model at the same throughput. The question is whether the quality improvement justifies the cost. For a product with 99% easy queries (straightforwardly answerable by 7B) and 1% hard queries needing 70B, using a routing layer (classify difficulty, route accordingly) achieves 70B quality at 7B average cost.

Speculative decoding is the highest-impact serving optimization that few candidates mention: a 7B draft model generates 4-6 candidate tokens in parallel; the 70B target model verifies all in one forward pass. When acceptance rate is > 70% (typical for fluent text), this gives 3-4x throughput for the same output quality.

The best DL engineer always trains the largest model possible given the compute budget

Compute-optimal training (Chinchilla) means matching model size to dataset size at the 20:1 token-to-parameter ratio; training a 7B model on 140B tokens beats training a 70B model on 14B tokens at the same compute budget

Larger models trained on insufficient data underperform smaller models trained to the Chinchilla-optimal point - the model size and dataset size must scale together, and inference cost must be included in the total cost calculation

A product serves both 'easy' queries (90%) and 'hard' queries (10%) that require high model quality. What is the most cost-efficient serving strategy?

Key Ideas

**Architecture design interviews** reward systematic thinking: clarify requirements first, justify each component choice, and proactively identify failure modes before the interviewer asks.
**Optimization debugging** follows a fixed order: overfit one batch (verify model/loss), check gradients, profile GPU utilization, then tune hyperparameters - never start with LR tuning.
**Scaling estimation** using Chinchilla laws: ~6*N*D total FLOPs, 20 tokens per parameter for compute-optimal training, and memory = parameters * 2 bytes * 4 (params + gradients + optimizer states + activations).

Вопросы для размышления

A FAANG interviewer asks: 'Design a system to generate personalized video thumbnails for a streaming platform with 50 million daily active users.' Structure a complete 45-minute interview response covering all key components.
You are given a training run where loss is stuck at a high value after 1000 steps. Walk through every diagnostic step in order, explaining what you are checking and why.
Estimate the total GPU-hours required to train a 13B parameter model on 260B tokens (Chinchilla-optimal) on A100 80GB GPUs, and determine how many GPUs are needed to complete training in 7 days.

Связанные уроки

dl-20 — System design questions reuse production design knowledge
dl-12 — Scaling questions test distributed training understanding
dl-02 — Backpropagation basics appear in optimization questions
ml-55-ml-system-design — FAANG ML system design interviews share the same format
alg-01-big-o — Complexity analysis frames scaling and tradeoff answers
stat-05-hypothesis — A/B testing reasoning supports evaluation questions
la-01-vectors-intro

Deep Learning

DL at the Interview (FAANG)

**Google ML Engineer L5 interviews** include one architecture design question where candidates design a real production system (search ranking, content recommendation, image understanding) and one ML optimization question where they debug a realistic training failure scenario - both evaluated on reasoning depth, not specific choices.
**OpenAI Research Engineer interviews** focus heavily on scaling estimation: 'how much compute does it take to train GPT-4?' and 'given this architecture, estimate memory requirements for training at 128 GPUs' - testing whether candidates can reason quantitatively about systems they have not directly worked on.
**Meta's production ML interviews** include a system design session where candidates design the full ML pipeline for a recommendation system serving 3 billion users - covering model choice, feature engineering, training infrastructure, serving, A/B testing, and monitoring - in 45 minutes.

AlexNet and the deep learning hiring boom

Предварительные знания

Architecture Design Questions

When designing a DL system for a FAANG interview, what is the most common mistake candidates make?

Optimization and Training Questions

Training loss decreases normally but validation loss starts increasing after epoch 5. What is the most likely cause?

Scaling Deep Learning Systems

According to Chinchilla scaling laws, what is the compute-optimal tokens-to-parameters ratio for training a large language model?

DL Engineering Tradeoffs

The best DL engineer always trains the largest model possible given the compute budget

A product serves both 'easy' queries (90%) and 'hard' queries (10%) that require high model quality. What is the most cost-efficient serving strategy?

Key Ideas

**Architecture design interviews** reward systematic thinking: clarify requirements first, justify each component choice, and proactively identify failure modes before the interviewer asks.
**Optimization debugging** follows a fixed order: overfit one batch (verify model/loss), check gradients, profile GPU utilization, then tune hyperparameters - never start with LR tuning.
**Scaling estimation** using Chinchilla laws: ~6*N*D total FLOPs, 20 tokens per parameter for compute-optimal training, and memory = parameters * 2 bytes * 4 (params + gradients + optimizer states + activations).

Вопросы для размышления

A FAANG interviewer asks: 'Design a system to generate personalized video thumbnails for a streaming platform with 50 million daily active users.' Structure a complete 45-minute interview response covering all key components.
You are given a training run where loss is stuck at a high value after 1000 steps. Walk through every diagnostic step in order, explaining what you are checking and why.
Estimate the total GPU-hours required to train a 13B parameter model on 260B tokens (Chinchilla-optimal) on A100 80GB GPUs, and determine how many GPUs are needed to complete training in 7 days.

Связанные уроки

dl-20 — System design questions reuse production design knowledge
dl-12 — Scaling questions test distributed training understanding
dl-02 — Backpropagation basics appear in optimization questions
ml-55-ml-system-design — FAANG ML system design interviews share the same format
alg-01-big-o — Complexity analysis frames scaling and tradeoff answers
stat-05-hypothesis — A/B testing reasoning supports evaluation questions
la-01-vectors-intro

DL at the Interview (FAANG)

AlexNet and the deep learning hiring boom

Предварительные знания

Architecture Design Questions

Optimization and Training Questions

Scaling Deep Learning Systems

DL Engineering Tradeoffs

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

DL at the Interview (FAANG)

AlexNet and the deep learning hiring boom

Предварительные знания

Architecture Design Questions

Optimization and Training Questions

Scaling Deep Learning Systems

DL Engineering Tradeoffs

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки