Computer Architecture

Computer Architecture at FAANG Interviews

The question 'why are two threads slower than one on separate variables?' eliminates 30% of L5 candidates at Google - the answer requires MESI protocol knowledge, not algorithmic intuition.

Meta E6 interview 2023: memory ordering question with ARM weak model - standard for distributed systems engineers because mobile clients run ARM
Google Staff design round: 'estimate max throughput of your cache tier' requires knowing L3 size, memory bandwidth, and NUMA topology - not just Redis documentation
Real production example: one alignas(64) annotation on a hot struct in a trading system's order book - 3x throughput improvement, measured in microseconds

Cache Coherence on the Whiteboard

A Google L5 phone screen staple: 'Two threads increment different variables. No locks. Why is the code slower with two cores than one?' The answer lives in the hardware, not the algorithm.

**Three cache questions at L5+ interviews:** 1. False sharing: two variables share a cache line, different threads write -> MESI invalidation storm. Fix: alignas(64) 2. Cache thrashing: stride equal to power-of-two hits same sets. Fix: pad arrays, change access pattern 3. Prefetcher blindness: linked list traversal cannot be prefetched (pointer chasing). Fix: array-based structures, prefetch hints

An interview asks: 'Why does iterating a linked list of 10,000 nodes run slower than iterating an array of equal size, even though both are O(n) operations?'

Memory Ordering and Branch Prediction: Staff-Level Questions

A Meta E6 real interview question from 2023: 'Can the assert on line 8 ever fire on a correct, bug-free machine?' The code has no data race in the C++ memory model sense. On ARM without barriers, the assert fires in production.

Branch misprediction is the second major pipeline question. Intel Raptor Lake runs a 20-stage pipeline. A mispredicted branch flushes all 20 stages and costs ~20 cycles. The TAGE predictor tracks branch history patterns - but 50/50 conditional branches defeat it. Practical interview framing: 'This hot loop has an if-else inside. How do you reduce branch overhead?' The answers range from sorted inputs to branchless cmov instructions.

**Showing depth on branch questions:** go beyond naming 'branch misprediction'. Explain what gets flushed (fetch, decode, execute stages in flight), how __builtin_expect hints the compiler, and why a branchless cmov avoids the penalty entirely. This signals you understand the pipeline, not just the vocabulary.

std::atomic<int> with memory_order_relaxed guarantees atomicity. Why is it still incorrect for thread synchronization?

Hardware-Aware System Design: What Staff Engineers Do Differently

At Staff-level system design, the question 'why not one high-throughput server instead of 100 smaller ones?' is answered with hardware numbers, not intuition. Bandwidth limits, NUMA topology, and cache capacity constraints show up in every high-throughput architecture discussion.

**NUMA topology** is the question that separates strong candidates from exceptional ones at Staff level. A 2-socket Intel Xeon server has local memory latency of 30 ns per socket and cross-socket latency of 120 ns. 'numactl --cpunodebind=0 --membind=0' pins a process to socket 0 memory. Without this, a memory-intensive service can show 4x higher p99 latency depending on which NUMA node the OS allocates pages from.

System design interviews are about distributed systems patterns - cache, queues, databases. Hardware knowledge is for firmware engineers.

At L5+ FAANG, hardware-aware reasoning is a differentiating signal. Explaining why ArrayList beats LinkedList requires cache line analysis, not just Big-O. Estimating max throughput requires knowing memory bandwidth, not just server count.

High-traffic systems at Google, Meta, and Amazon are bottlenecked by hardware - cache misses, memory bandwidth, NUMA effects. Senior engineers who understand the hardware beneath the software make better architectural decisions and catch performance bugs that escape conventional analysis.

An interviewer asks: 'Your service has a hot HashMap with 5M lookups/sec. How do you approach optimization?' What is the most important first question?

Architecture knowledge surfaces everywhere

Hardware understanding shows up in SWE, ML engineer, and systems design rounds at all senior levels.

Cache hierarchy — Related topic
Virtual memory — Related topic

Итоги

False sharing: two variables, one cache line, two threads -> MESI invalidation per write. Fix: alignas(64)
ARM weak memory model: store-store reordering without barriers is valid. memory_order_release/acquire is required for correct cross-thread signaling
Branch misprediction: 20-cycle pipeline flush on modern OOO CPUs. cmov, sorted data, __builtin_expect are tools
NUMA cross-socket penalty: 4x latency increase. numactl pinning and NUMA-aware allocators are not optional for latency-sensitive services
Staff interview signal: arithmetic intensity -> roofline -> CPU vs GPU decision. That framing is what separates E5 from E6 system design answers

Вопросы для размышления

In the next interview, if asked 'why is ArrayList faster than LinkedList for iteration?', try explaining it entirely through cache lines, the hardware prefetcher, and pointer-chasing latency - without mentioning Big-O. What does that explanation reveal that the algorithmic one misses?

Связанные уроки

arch-09-cache — MESI protocol and false sharing are the most common cache questions on systems design screens
arch-06-pipelining — Pipeline hazards and branch prediction appear in every L5+ systems phone screen
arch-10-virtual-memory — TLB shootdown and huge pages span the OS-architecture boundary in senior interviews
arch-19-memory-bandwidth — Roofline and bandwidth analysis is the language of hardware-aware system design rounds
arch-18-ai-accelerators