Reinforcement Learning

RL in Interviews (FAANG)

Предварительные знания

PPO and the clipped objective: the most-referenced algorithm in RL interviews
Offline RL (CQL, IQL): why production systems rarely allow online exploration
RLHF: reward models, the KL constraint, and DPO for language-model alignment
MDP basics: state, action, reward, transition, discount - the formulation every design question starts from

From research curiosity to a hiring priority

RL was an academic niche until AlphaGo beat Lee Sedol in March 2016, the first time a program defeated a top human Go professional. That result moved RL from research labs into industry roadmaps and recruiting pipelines. The second surge came in late 2022, when ChatGPT made RLHF (Reinforcement Learning from Human Feedback) the central technique behind aligned language models. Within months, 'RL engineer' and 'RLHF' became some of the most in-demand and highest-paid ML specializations, and FAANG-style interviews began testing RL system design directly rather than treating it as exotic.

Google DeepMind. Interview loop. The question: 'Design an RL system for Google Maps ETA prediction.' The trap: it is not an RL problem. The insight that impresses: recognizing when RL is the wrong tool. FAANG ML interviews test this judgment more than algorithm derivations. The candidate who says 'I'd formulate this as a contextual bandit, because the reward is immediate and there is no multi-step dependency' beats the one who immediately reaches for PPO. This lesson is about that judgment - how to think about RL in the context of real system design.

**Netflix recommendation (2022)** - Netflix engineering blog describes starting with contextual bandits before graduating to sequential RL; their staged approach (offline eval → shadow mode → 1% traffic → full rollout) is standard industry practice
**Google Ads bidding** - production bidding uses offline RL (CQL) trained on billions of logged auction interactions; exploratory policies never run on live traffic due to financial risk; the system retrains daily on fresh logged data
**Meta RLHF infrastructure (2023)** - Meta's open-source RLHF system uses 512 GPU actors generating trajectories and a centralized learner; the engineering paper describes how V-trace importance sampling handles the policy staleness from distributed actors

MDP Design in Interviews

FAANG ML interviews increasingly include RL design questions - not algorithm derivations, but system design: 'Design a recommendation system using RL', 'How would you train an ad bidding agent?', 'What RL approach would you use for ride-sharing pricing?'. The interviewer is evaluating whether the candidate can translate a business problem into an MDP formulation and reason about the engineering tradeoffs. The first step is always the same: define the MDP precisely.

**MDP design checklist for interviews:** 1. **State space S**: What information does the agent observe? Is it fully or partially observable? 2. **Action space A**: Discrete (which ad to show) or continuous (bid price 0-100)? Size? 3. **Reward function r(s,a,s')**: What scalar signal captures business success? How delayed is it? 4. **Transition dynamics**: Deterministic (game) or stochastic (real world)? Modeled or unknown? 5. **Episode structure**: Episodic (game session) or continuing (always-on service)? 6. **Discount factor γ**: Short-horizon (γ=0.9) or long-horizon (γ=0.999)?

**Common interview mistake**: jumping to the algorithm before defining the MDP. Interviewers at Google, Meta, and DeepMind consistently report that candidates who say 'I'd use PPO' in the first minute miss the nuances. Define state, action, and reward first - spend 10 minutes on this. The algorithm choice often falls out naturally once the MDP is clear.

In an RL interview, why should reward design be discussed before algorithm selection?

Algorithm Selection Framework

After the MDP is defined, the interview moves to algorithm selection. The interviewer is testing whether the candidate understands the tradeoff space - not whether they memorize algorithm names. A practical framework: four axes determine the right algorithm family.

Axis	Question	Determines
Action space	Discrete or continuous?	Q-learning vs policy gradient
Sample efficiency	Real-world env (expensive) or simulator (cheap)?	Off-policy vs on-policy
Environment access	Can the model be learned? Reset available?	Model-based vs model-free
Reward structure	Dense or sparse? Single or multi-objective?	Algorithm + reward shaping needs

**Contextual bandits** deserve special mention in interviews: they are one-step MDPs where the reward is immediate and there is no temporal credit assignment. They cover ~80% of recommendation, ad ranking, and A/B testing problems. Interviewers at Netflix, Spotify, and LinkedIn report that candidates who jump to deep RL when a bandit would suffice signal poor judgment. Start with the simplest model that fits the problem.

A candidate is designing an RL system for real-time ad bidding. The environment is the live auction system - no simulator, expensive interactions, continuous bid prices. Which algorithm family is most appropriate?

Scaling RL Systems

Toy RL problems (CartPole, single-machine DQN) do not prepare engineers for production. Netflix serves 200M+ users, each with a personalized recommendation policy. YouTube generates billions of impressions per day. The gap between lab RL and production RL is primarily an engineering problem - parallelization, data pipelines, policy deployment, and monitoring.

**Production RL architecture components:** 1. **Experience generation**: actors run policies and write (s,a,r,s') to a distributed buffer (e.g., Reverb by DeepMind) 2. **Centralized training**: learner reads from buffer, updates weights, broadcasts new policy 3. **Policy serving**: trained policy exported and served via TensorFlow Serving or Triton 4. **Monitoring**: reward tracking, KL divergence from last policy, action distribution drift 5. **Shadow mode**: new policy runs in parallel with production, logs rewards without affecting users

**Off-policy correction** is crucial in distributed RL: by the time a trajectory reaches the learner, the policy has changed. IMPALA uses V-trace importance sampling to correct for this stale data. Without correction, the system silently trains on off-policy data as if it were on-policy - causing subtle instability that is hard to debug.

In a distributed RL system with 1000 actors and 1 GPU learner, what is the most common performance bottleneck?

RL Tradeoffs: What Interviewers Actually Test

The final interview stage tests whether a candidate can reason about tradeoffs and failure modes - not just recite algorithm names. Senior RL roles at DeepMind, Google Brain, and Meta expect candidates to discuss: when NOT to use RL, what can go wrong, and how to monitor for silent failures in production.

Problem	RL Risk	Alternative to Consider
Short-horizon decision (1-3 steps)	Reward attribution trivial; RL adds complexity	Supervised/ranking model
Stable, well-defined problem with lots of labels	RL sample inefficiency	Supervised learning
Non-stationary environment (users change behavior)	Policy becomes stale; reward drift	Frequent retraining + bandit
Hard-to-evaluate reward	Reward hacking	Human evaluation loop
Safety-critical (medical, finance)	Exploration too risky	Offline RL + conservative deployment

**The most impressive interview answer** to 'Design an RL system for X' often includes a plan to NOT use full sequential RL initially: start with logged data + offline RL or contextual bandits, establish reward monitoring infrastructure, prove value on a small slice of traffic (1%), then graduate to online RL. This staged approach - which companies like Netflix, Spotify, and LinkedIn actually use - demonstrates production maturity.

RL is the best tool whenever a problem involves sequential decisions and delayed reward

RL is appropriate when supervised learning lacks sufficient labels, the action space is too large for exhaustive evaluation, or the policy needs to adapt to feedback - many sequential problems are better solved with supervised ranking, bandits, or rule-based systems

RL is sample-inefficient, unstable, and hard to debug. The bar for using it in production is high. Netflix uses contextual bandits for most recommendation tasks. Google Ads uses gradient boosted trees with bandit feedback for most bidding. Deep sequential RL is reserved for problems where the action-reward dependency is long-horizon (game-length) and cannot be broken into independent decisions. In interviews, the sophistication is knowing when NOT to use RL.

An RL system for video recommendations shows increasing RL reward but decreasing long-term user retention. What is the most likely cause?

Key ideas

**MDP formulation first**: define state, action, reward, and episode structure before touching algorithms; poorly specified reward makes any algorithm fail
**Algorithm selection axes**: action space type (discrete → DQN, continuous → SAC/TD3), environment cost (cheap sim → on-policy, expensive real → offline RL), and horizon length determine the algorithm family
**Production architecture**: distributed actors (env throughput) + centralized GPU learner + replay buffer + policy broadcast; environment simulation, not GPU, is typically the bottleneck
**Interview meta-skill**: recognizing when NOT to use RL (short-horizon, abundant labels, safety-critical with no sim) and proposing a staged rollout plan (bandit → offline RL → online RL) signals production maturity

Вопросы для размышления

A senior interviewer asks: 'When would you choose offline RL over online RL, and what are the risks of each?' Sketch a 3-minute answer covering CQL, distribution shift, and deployment strategy.
Design a reward function for a ride-sharing surge pricing agent. What proxy metrics are tempting but dangerous? How would production monitoring reveal reward hacking within the first week?
An RL system for content moderation is proposed. The action is 'remove post' or 'keep post'. Appeals take 72 hours. What challenges does this create for the MDP formulation - and is RL even the right tool?

Связанные уроки

rl-02 — MDP formulation is the first interview step
rl-07 — DQN is a canonical algorithm-selection answer
rl-10 — PPO is the most-cited algorithm in interviews
rl-17 — RLHF system design appears in modern interviews
ml-55-ml-system-design — Same structured tradeoff reasoning as ML system design
prob-01-intro

Reinforcement Learning

RL in Interviews (FAANG)

Предварительные знания

PPO and the clipped objective: the most-referenced algorithm in RL interviews
Offline RL (CQL, IQL): why production systems rarely allow online exploration
RLHF: reward models, the KL constraint, and DPO for language-model alignment
MDP basics: state, action, reward, transition, discount - the formulation every design question starts from

From research curiosity to a hiring priority

**Netflix recommendation (2022)** - Netflix engineering blog describes starting with contextual bandits before graduating to sequential RL; their staged approach (offline eval → shadow mode → 1% traffic → full rollout) is standard industry practice
**Google Ads bidding** - production bidding uses offline RL (CQL) trained on billions of logged auction interactions; exploratory policies never run on live traffic due to financial risk; the system retrains daily on fresh logged data
**Meta RLHF infrastructure (2023)** - Meta's open-source RLHF system uses 512 GPU actors generating trajectories and a centralized learner; the engineering paper describes how V-trace importance sampling handles the policy staleness from distributed actors

MDP Design in Interviews

In an RL interview, why should reward design be discussed before algorithm selection?

Algorithm Selection Framework

Axis	Question	Determines
Action space	Discrete or continuous?	Q-learning vs policy gradient
Sample efficiency	Real-world env (expensive) or simulator (cheap)?	Off-policy vs on-policy
Environment access	Can the model be learned? Reset available?	Model-based vs model-free
Reward structure	Dense or sparse? Single or multi-objective?	Algorithm + reward shaping needs

Scaling RL Systems

In a distributed RL system with 1000 actors and 1 GPU learner, what is the most common performance bottleneck?

RL Tradeoffs: What Interviewers Actually Test

Problem	RL Risk	Alternative to Consider
Short-horizon decision (1-3 steps)	Reward attribution trivial; RL adds complexity	Supervised/ranking model
Stable, well-defined problem with lots of labels	RL sample inefficiency	Supervised learning
Non-stationary environment (users change behavior)	Policy becomes stale; reward drift	Frequent retraining + bandit
Hard-to-evaluate reward	Reward hacking	Human evaluation loop
Safety-critical (medical, finance)	Exploration too risky	Offline RL + conservative deployment

RL is the best tool whenever a problem involves sequential decisions and delayed reward

An RL system for video recommendations shows increasing RL reward but decreasing long-term user retention. What is the most likely cause?

Key ideas

**MDP formulation first**: define state, action, reward, and episode structure before touching algorithms; poorly specified reward makes any algorithm fail
**Algorithm selection axes**: action space type (discrete → DQN, continuous → SAC/TD3), environment cost (cheap sim → on-policy, expensive real → offline RL), and horizon length determine the algorithm family
**Production architecture**: distributed actors (env throughput) + centralized GPU learner + replay buffer + policy broadcast; environment simulation, not GPU, is typically the bottleneck
**Interview meta-skill**: recognizing when NOT to use RL (short-horizon, abundant labels, safety-critical with no sim) and proposing a staged rollout plan (bandit → offline RL → online RL) signals production maturity

Вопросы для размышления

A senior interviewer asks: 'When would you choose offline RL over online RL, and what are the risks of each?' Sketch a 3-minute answer covering CQL, distribution shift, and deployment strategy.
Design a reward function for a ride-sharing surge pricing agent. What proxy metrics are tempting but dangerous? How would production monitoring reveal reward hacking within the first week?
An RL system for content moderation is proposed. The action is 'remove post' or 'keep post'. Appeals take 72 hours. What challenges does this create for the MDP formulation - and is RL even the right tool?

Связанные уроки

rl-02 — MDP formulation is the first interview step
rl-07 — DQN is a canonical algorithm-selection answer
rl-10 — PPO is the most-cited algorithm in interviews
rl-17 — RLHF system design appears in modern interviews
ml-55-ml-system-design — Same structured tradeoff reasoning as ML system design
prob-01-intro

RL in Interviews (FAANG)

Предварительные знания

From research curiosity to a hiring priority

MDP Design in Interviews

Algorithm Selection Framework

Scaling RL Systems

RL Tradeoffs: What Interviewers Actually Test

Key ideas

Related topics

Вопросы для размышления

Связанные уроки

RL in Interviews (FAANG)

Предварительные знания

From research curiosity to a hiring priority

MDP Design in Interviews

Algorithm Selection Framework

Scaling RL Systems

RL Tradeoffs: What Interviewers Actually Test

Key ideas

Related topics

Вопросы для размышления

Связанные уроки