DevOps
DevOps Interview Prep (FAANG)
A Senior DevOps Engineer at FAANG earns $250-450K/year. The difference between junior and senior in an interview is not knowing more commands - it is the ability to reason about systems under uncertainty and communicate tradeoffs demonstrably. This is a learnable skill.
- **Google L5 SRE** typical questions: 'How would you detect that your service is starting to degrade 5 minutes before a full outage?' - testing understanding of SLI/SLO and alerting strategy.
- **Amazon Principal Engineer** system design: 'Design a deploy pipeline for 10,000 microservices with zero-downtime requirements' - expecting cell-based deployment and canary strategy.
- **Meta Staff Engineer** outage question: '100M users cannot log in. You have 15 minutes. What do you do?' - evaluating systematic troubleshooting, communication, and prioritization.
Design CI/CD Pipeline
Typical FAANG question: 'Design a CI/CD pipeline for a microservice with 100 engineers'. Expected structure: clarifying questions first, then pipeline stages with justification, then rollback strategy, database migrations, secrets, and canary deployment.
FAANG interviewers evaluate the process: did the candidate ask clarifying questions before diving into the solution? A candidate who immediately describes a specific stack (GitHub Actions + ArgoCD + Helm) without asking about the context scores lower than one who gathers requirements first.
What should be the first step when answering 'Design CI/CD pipeline' in an interview?
Troubleshoot Production Outage
Question: 'Production API returned 500 errors for 5 minutes. What do you do?' Structure: Detect - Triage - Mitigate - Root Cause - Postmortem. Key principle: mitigation is primary, root cause is secondary.
FAANG interviews reward systematic approach over specific tool knowledge. An engineer who says 'first I check Kibana logs, then Jaeger traces, then deployment history' demonstrates the right mental model regardless of which specific tools are named.
During a production outage: mitigation or root cause analysis - what to do first?
Capacity Planning
Capacity planning determines resources needed to handle expected load with headroom. FAANG interviews expect: baseline measurement, peak estimation with safety factor, and cost/performance tradeoff awareness.
Capacity planning interviews test numerical reasoning, not memorized formulas. State assumptions explicitly: 'I am assuming linear scaling - if caching hit rate drops at peak, actual capacity needed may be higher.'
Why is target utilization 70% in capacity planning rather than 100%?
Architectural Tradeoffs
FAANG DevOps interviews test the ability to justify tradeoffs between alternatives. There is no single 'correct' answer - the correct answer is: describe both options with context-specific justification, and name what is sacrificed in the chosen approach.
Tradeoff questions have no correct answer key. Interviewers at Meta, Google, and Stripe evaluate reasoning quality: 'Given these constraints (high write throughput, global distribution, eventual consistency acceptable), DynamoDB fits better than PostgreSQL because...' This framing demonstrates senior-level thinking.
DevOps interviews require memorizing all commands and configurations
FAANG DevOps interviews test system thinking and tradeoff reasoning - the ability to design, debug, and explain decisions under constraints.
A candidate who perfectly recites `kubectl` flags but cannot explain why they would choose Kubernetes over serverless for a given workload will not pass a FAANG Staff engineer interview. Commands can be looked up; reasoning cannot.
How do you correctly answer a tradeoff question in a FAANG DevOps interview?
Summary
- **CI/CD Design** - start with requirements (not tools); describe pipeline stages with justification; key topics: rollback strategy, DB migrations, secrets, canary deployment.
- **Troubleshoot Outage** - mitigation first, root cause second; structure: Detect - Triage - Mitigate - Root Cause - Postmortem; demonstrate systematic approach.
- **Capacity + Tradeoffs** - calculate server count with safety factor and target utilization; tradeoff questions require describing both options with context, not naming a winner.
Related Topics
Interview preparation builds on the full DevOps knowledge stack:
- Reliability Engineering at Scale — Cell architecture, blast radius, and chaos engineering are typical senior DevOps design questions at FAANG.
- On-Call and Incident Management — The troubleshoot outage question tests understanding of incident response process and postmortem culture.
Вопросы для размышления
- How do you explain the choice of Kubernetes vs Lambda for image processing at 1,000 uploads/day vs 1M uploads/day?
- Capacity planning: what metrics are needed to answer 'how many servers do we need next year'?
- If the interviewer says 'your answer is wrong' on a tradeoff question - how do you correctly respond?