Scientific Computing
Scientific Computing in Interviews
In 1996 the Ariane 5 rocket exploded 39 seconds after launch - a $370M loss. The cause: a 64-bit float that did not fit into a 16-bit integer in code inherited from Ariane 4. In 1991 a Patriot missed because of round-off accumulation on 0.1 in binary - 28 soldiers killed. In a scientific computing interview, questions about conditioning, stability, and algorithm choice are not academic exercises but a defensive line against real catastrophes. A strong candidate demonstrates not formula knowledge but the ability to link a numerical detail to a system consequence: 'why we avoid normal equations for regression', 'when to switch from FP32 to FP16', 'how to discuss precision vs latency'.
- **Ariane 5 (1996)**: a $370M rocket exploded from an integer overflow in legacy code - the classic numerical-issue case
- **Patriot Dhahran (1991)**: round-off accumulation on 0.1 in binary drifted the clock 0.34 s over 100 hours, 28 deaths
- **Google TPU**: bfloat16 was designed specifically for ML workloads - a different range/precision trade-off vs IEEE float16
- **OpenAI GPT-4 inference**: mixed FP16/INT8 quantization deployed across 1M+ requests/sec while preserving 99% quality
- **LAPACK/MKL/OpenBLAS**: industrial libraries tuned to specific CPUs - hand-beating them is essentially impossible
Numerical Analysis: conditioning and precision
In 1991 a Patriot missile in Dhahran missed an incoming Iraqi Scud that hit the barracks and killed 28 soldiers. The cause: systematic round-off accumulation from multiplying 0.1 in binary representation. After 100 hours of uptime the internal clock had drifted by 0.34 seconds - enough for the targeting system to miss by 600 meters. In an interview, the question 'why is `0.1 + 0.2 != 0.3`?' is an entry point to a deeper conversation about **conditioning** of the problem and **stability** of the algorithm. Conditioning is a property of the problem (an ill-conditioned matrix inverts poorly under any algorithm). Stability is a property of the algorithm (Gaussian elimination with pivoting is stable, without it not always). A strong candidate keeps the two apart and does not confuse the source of error.
The condition number of a matrix is κ(A) = ‖A‖ * ‖A^-1‖. If κ = 10^6, an input relative error of 10^-7 (the order of machine epsilon in single precision) can blow up to 10^-1 in the output. In double precision (eps ~ 10^-16) one can work up to κ ~ 10^10. Beyond that the problem must be reformulated: SVD with truncation, LSQR instead of normal equations, scaling. Numerical analysis in an interview is not about knowing formulas - it is about reading diagnostics (residual, condition estimate) and proposing a reformulation.
A candidate says: 'Condition number 10^14 - so the algorithm is unstable.' What is wrong with that statement?
Algorithm Design: picking a method to match the problem
Interviewer: 'Solve Ax=b with A a 10^6 x 10^6 matrix.' A weak candidate proposes `np.linalg.solve` (O(n^3)) and the interview ends. A strong candidate asks clarifying questions: sparse? structured? symmetric positive definite? How many right-hand sides with the same A? What accuracy is needed? Each answer redirects the algorithm: dense LU is O(n^3) and 8 TB of memory, inapplicable. Sparse direct (CHOLMOD) is O(n * nnz^2) with structure. Iterative (CG for SPD, GMRES in general) is O(k * nnz), with k iterations ~ sqrt(κ). A preconditioner (ILU, multigrid) speeds things up dramatically. Without dialogue there is no correct answer - that is what is being tested.
An interview decision tree: (1) **Size**: <10^4 - dense LU/QR; <10^7 sparse - direct solver; >10^7 - iterative. (2) **Structure**: symmetric? positive definite? Toeplitz/circulant (FFT)? banded? block structure? (3) **Use pattern**: single solve - direct; same A, many b - factorize once, reuse; varying A - iterative with warm starts. (4) **Hardware**: CPU - BLAS3 for dense; GPU - batched ops; distributed - block algorithms. A candidate who starts with clarifying questions scores higher than one who jumps to code.
Interview anti-patterns: (a) jumping to code without clarifying questions; (b) suggesting `np.linalg.inv(A) @ b` instead of `solve` - inversion is 3x slower and less stable; (c) ignoring structure and applying a dense method; (d) forgetting a preconditioner for iterative methods on ill-conditioned systems.
Interviewer: 'You have a 5000x5000 matrix A and need to solve Ax=b for 1000 different b with the same A. What would you suggest?'
Optimization: BLAS, vectorization, and parallelism
A candidate writes a numerical loop in Python and brags about the result: 100 GFLOPS on a matrix product. The interviewer pulls out a laptop, runs `numpy.dot` on the same machine: 500 GFLOPS. The gap is BLAS. **BLAS Level 1** (vector ops) does 1 operation per 2 memory accesses - memory-bound. **Level 2** (matrix-vector) does 2 ops per 1.5 memory accesses - still memory-bound. **Level 3** (matrix-matrix) does O(n^3) ops per O(n^2) memory - compute-bound, capable of reaching peak FLOPS. A good candidate knows that computations should be reformulated in Level 3 form to amortize memory bandwidth. NumPy/SciPy under the hood call MKL/OpenBLAS, which already did the work.
The roofline model is the analysis tool for interviews: X-axis - arithmetic intensity (FLOP/byte), Y-axis - performance (FLOPS). The roof line = min(peak_FLOPS, bandwidth * intensity). An algorithm with intensity below the break-even point is memory-bound; above it, compute-bound. AXPY (Level 1) has intensity 1/4 - memory-bound. GEMM (Level 3) has intensity n/3 - compute-bound for large n. Vectorization (SIMD AVX-512) provides 8x speedup in double precision, multi-threading adds another 16-64x on modern CPUs. A GPU adds another 10-100x on top for compute-bound code.
Interview question: 'You have a loop over 10^9 elements with independent computations per element. Is `multiprocessing.Pool` enough?'
Tradeoffs: accuracy vs speed, memory vs compute
Interviewer: 'Design an inference system for a 10B-parameter network at 1000 requests/second.' A senior candidate spells out trade-offs: **FP16 vs FP32** - 2x speed and 2x memory, with 0.5% accuracy loss on typical models. **INT8 quantization** - another 4x speedup, 1-2% loss, requires calibration. **Sparsity** - 2:4 structured sparsity on A100/H100 gives 2x matmul speedup at 50% of parameters. **Distillation** - a 10x smaller model with 95% accuracy of the original. Each trade-off is not 'good/bad' but a specific point on the performance-vs-quality curve. A candidate able to walk those points demonstrates architectural thinking.
Universal interview trade-off patterns: (1) **memory vs compute** - recomputation vs caching (FFT recompute, gradient checkpointing); (2) **precision vs speed** - mixed precision training (FP16 forward, FP32 master weights); (3) **accuracy vs latency** - approximate algorithms (Locality Sensitive Hashing for nearest neighbour); (4) **throughput vs latency** - batching (latency rises, throughput rises too); (5) **complexity vs maintainability** - cuBLAS vs hand-tuned CUDA. A senior candidate does not pick 'the best' - they assess the position on the Pareto frontier for the specific use case.
Numerical code is about formulas and the BLAS API
Numerical code in an interview is about reading and discussing trade-offs: conditioning vs stability, dense vs sparse, direct vs iterative, FP32 vs FP16, memory vs compute. A candidate who knows 50 formulas but cannot connect them to a system loses to a candidate who asks clarifying questions and weighs alternatives
Senior level in HPC / numerical work is the ability to synthesize a solution from constraints (memory, time, accuracy, hardware), not to recite a textbook. Modern libraries (NumPy, SciPy, PyTorch, JAX) hide implementation detail - what matters is which API to use when and why. Interviewers look for systems thinking, not numerical recall.
Interviewer: 'The application is real-time inference with a 50 ms budget. The model takes 100 ms in FP32. What is the move?'
Key Ideas
- **Conditioning vs stability** is the interview-grade terminology distinction: conditioning lives in the problem (κ(A)), stability lives in the algorithm
- **Method choice** begins with clarifying questions: size, structure, usage frequency, hardware; the answer follows from the answers
- **BLAS Level 3** amortizes memory bandwidth - reformulating computations into matmul form is the central optimization for scientific code
- **Trade-offs** in interviews: FP32/FP16/INT8/INT4, dense/sparse, direct/iterative, recompute/cache; senior candidates walk the Pareto frontier instead of picking 'the best'
- **Numerical interviews** are not about formulas but systems thinking: linking a numerical detail to system consequences (cost, latency, accuracy, certification)
Related Topics
Scientific computing interviews intersect with HPC, ML deployment, and systems design:
- Scientific Computing at Scale — Cloud HPC, reproducibility, workflows - the production consequences of the trade-offs discussed in interviews
- Numerical Optimization — Gradient descent, Newton, L-BFGS - a family of algorithms with convergence-rate vs memory trade-offs
- Parallel Scientific Computing — BLAS Level 3 + multi-threading + GPU - the optimization stack on which modern numerical code runs
Вопросы для размышления
- Patriot and Ariane show that a numerical bug can cost lives and billions. Which verification processes can prevent such error classes in modern production code?
- Senior engineers discuss trade-offs instead of returning a 'correct method'. What other CS areas demand that mindset, and where is the opposite (a direct answer) expected?
- Mixed precision training (FP16+FP32 master) is an architectural decision hiding complexity from the user. Which other precision pairs may be useful (for example INT8 inference with FP32 calibration)?
Связанные уроки
- sci-13 — HPC interview builds on knowledge of scaling
- alg-01-big-o — Algorithm complexity is a baseline HPC interview question
- par-06 — MPI is a standard question for scientific computing roles
- opt-16 — Both lessons are interview prep formats for adjacent fields
- dl-12 — HPC interviewers often ask about distributed deep learning
- nm-01