Parallel Computing

Parallel Computing in Interviews

Concurrency questions appear in virtually every senior software engineering interview at FAANG and systems companies. A race condition that passes all unit tests, deploys to production, and causes incorrect counts once per million requests is the hardest class of bug to reproduce and debug. Interviewers test whether candidates can reason about concurrent correctness before the code is written, not debug it after. The patterns in this lesson - race condition analysis, bounded blocking queues, Little's Law sizing, and tradeoff quantification - appear directly in Google, Amazon, Meta, and Microsoft system design and coding rounds.

**Google's senior engineering interviews** consistently include a concurrent data structure problem (implement a thread-safe cache with expiration, design a rate limiter) where candidates must code the correct implementation with locks, discuss the correctness argument, and analyze performance characteristics under contention.
**Amazon's system design interviews** at the L6 (senior) level include capacity planning questions requiring Little's Law and Amdahl's Law calculations: 'How many threads/servers are needed for this throughput target?' with specific numbers that require quantitative reasoning, not vague estimates.
**Uber's production concurrency incidents** drove their internal training to require all engineers to demonstrate understanding of Java Memory Model visibility guarantees and happens-before relationships - after a fleet-wide race condition in their surge pricing calculation caused pricing errors during peak events.

Diagnosing Race Conditions

Race condition interview questions present code with shared mutable state and ask candidates to find the bug. The key diagnostic question: is there a window where two threads can interleave in a way that produces an incorrect result? The answer requires mentally executing all possible thread interleavings, not just the happy path. Classic examples: increment without atomics, lazy initialization (double-checked locking), and event handler registration.

Double-checked locking (DCL) is the most common incorrect attempt at lazy initialization. Without the volatile keyword on the singleton field, a second thread can see a partially-constructed object because the JVM can reorder writes: the object reference can be published before all its fields are initialized. DCL with volatile is correct; DCL without volatile is a well-documented anti-pattern.

Java's Memory Model guarantees: (1) volatile writes happen-before subsequent volatile reads of the same variable; (2) synchronized exits happen-before subsequent synchronized entries to the same monitor; (3) Thread.start() happens-before any action in the started thread. Without these guarantees, all code is theoretically subject to reordering.

Why is count++ not atomic in Java even on a single CPU core?

Designing Concurrent Data Structures

Concurrent data structure design interviews ask candidates to build thread-safe versions of common structures: bounded blocking queue, concurrent hash map, read-write lock, or thread-safe LRU cache. The evaluation criteria: correctness (no races), performance (appropriate locking granularity), and deadlock prevention (consistent lock ordering).

Bounded blocking queue is the canonical concurrent programming interview problem: producers block when full, consumers block when empty, both must notify waiting threads on state change. The implementation uses two conditions (notFull, notEmpty) on a single ReentrantLock - or a simpler single Condition if the interviewer allows non-optimal signaling.

The while loop around condition.await() (not if) is mandatory: spurious wakeups can occur where await() returns without being signaled. Without while, a spurious wakeup causes the consumer to take from an empty queue. Java's Object.wait() documentation explicitly warns about spurious wakeups.

Why must the condition check in a blocked producer/consumer use while rather than if?

Throughput and Latency Analysis

Throughput and latency interview questions require back-of-envelope calculation. Key formulas: Little's Law (L = lambda * W: average items in system = arrival rate * average time in system), Amdahl's Law (speedup = 1 / (s + (1-s)/p): s = sequential fraction), and utilization (U = lambda / mu: throughput/service rate). These quantify the cost of adding concurrency and the limits of scaling.

Utilization is the critical metric for capacity planning: at U < 0.7, systems are comfortable; at U = 0.9, queue depths grow to 9x service time (queueing theory M/M/1: W = 1/(mu-lambda)); at U > 0.95, systems become unstable under burst traffic. Production systems target 60-70% utilization to handle 2x bursts without queue buildup.

Latency percentiles (p50, p95, p99, p999) are more actionable than averages in concurrent systems. A system with p50=10ms and p99=500ms indicates a bimodal distribution - likely a queueing effect where 1% of requests wait behind a long queue. Reducing the tail requires analyzing the source of the bimodality, not the average.

A system processes 100 req/s and each request takes 50ms average. How many requests are in-flight simultaneously (Little's Law)?

Concurrency Tradeoffs in Interviews

Concurrency tradeoff questions test whether candidates can reason about complexity-performance-correctness. Common formats: 'When would you use a read-write lock vs. a mutex?', 'ConcurrentHashMap vs. Collections.synchronizedMap - when does each win?', 'Thread-per-request vs. async event loop - which do you choose and why?'. Weak answers give a rule; strong answers give a quantified condition.

Read-write lock: correct when reads vastly outnumber writes and read operations are non-trivial. The overhead of ReadWriteLock (complex lock protocol) exceeds its benefit when reads are fast (< 1 microsecond) or write frequency is high (> 20% of operations). ReentrantReadWriteLock has a 5-10% overhead vs. synchronized even in read-only workloads.

LMAX Disruptor (Martin Fowler pattern) achieves 25 million messages/second on commodity hardware by eliminating locks entirely: a ring buffer with a single writer per slot, false-sharing padding between slots, and memory barriers instead of locks. Used in financial trading systems where microsecond latency matters.

Using more threads always improves performance in a concurrent system

Thread count beyond the optimal point (CPU cores * I/O ratio) increases context switching overhead, memory pressure (1MB stack per thread), and lock contention - performance decreases

Amdahl's Law shows diminishing returns from adding processors to any workload with a sequential component; beyond the optimal thread count, threads spend more time waiting for locks and context switching than doing useful work

When does LongAdder outperform AtomicLong for a shared counter?

Key Ideas

**Race condition diagnosis** requires enumerating all thread interleavings at the bytecode level; count++ is three operations (read-add-write); DCL requires volatile; while loops are mandatory around condition.await() due to spurious wakeups.
**Little's Law** (L = lambda * W) determines in-flight requests and required thread pool size; Amdahl's Law bounds speedup from parallelism - sequential fraction sets the ceiling regardless of thread count.
**Tradeoff quantification** separates strong from weak interview answers: LongAdder beats AtomicLong under high contention; ReadWriteLock beats mutex only when reads >> writes and are non-trivial; ConcurrentHashMap beats synchronizedMap for concurrent reads.

Вопросы для размышления

Implement a thread-safe LRU cache in Java with O(1) get and put operations. The cache has a maximum capacity; eviction is LRU. Which synchronization primitives would you use and why?
A service runs on 8-core servers and serves requests with 10ms CPU time and 90ms I/O wait at 1000 req/s load. Using Little's Law and the I/O ratio formula, determine the optimal thread pool size and explain the discrepancy between the two methods.
Diagnose the following code: a singleton initialized via double-checked locking without the volatile keyword. Draw the exact instruction reordering that causes another thread to see a partially-initialized object.