Dynamical Systems

Dynamical Systems in ML Interviews

Of 100 candidates for a Research Scientist role at Google DeepMind, 90 know PyTorch, 50 know Transformer architectures, and 10 can explain why those architectures work through mathematics. Those 10 are the ones who get offers. Dynamical systems is one of the primary differentiators.

**Google Brain / DeepMind:** questions on spectral norm regularization, GAN training stability, Lyapunov stability in RL
**Anthropic:** questions on generalization mechanisms, the loss landscape, why certain initializations work better
**Nvidia Research:** Physics-Informed NNs, Neural ODE for simulation, Hamiltonian NN for molecular dynamics

Предварительные знания

Dynamical Systems in ML

Stability

**"Explain what stability means in an ML context"** is a question often heard at DeepMind, Google Brain, or for a Research Scientist role. Most candidates answer only about training. But behind it lie three distinct concepts: stability of equilibria (Lyapunov), training stability (gradient stability), and stability under perturbations (robustness).

**Three kinds of stability in ML:** 1) **Training stability:** vanishing/exploding gradients correspond to Lyapunov exponents of layer Jacobians; 2) **Structural stability:** the model behaves predictably under small input changes (adversarial robustness); 3) **Lyapunov stability:** in the context of RNNs or dynamical systems, an equilibrium is stable if small initial perturbations remain small.

Interview question	Core concept	Strong answer
Why are deep networks hard to train?	Lyapunov exponents of Jacobians	Vanishing/exploding gradient = instability in the matrix product
What is adversarial robustness?	Sensitivity to initial conditions	Small input perturbations cause large output changes (λ > 0)
Why does BatchNorm work?	Stabilization of dynamics	Normalizes the Jacobian toward the identity, stabilizing gradient propagation
Why use skip connections?	Neutral stability	Adds I to the Jacobian, so eigenvalues ≥ 1 and gradients don't vanish

**Interview answer strategy:** start with the formal definition (Lyapunov), connect it to a specific ML context (gradient stability), give a formula or code example, and mention a practical solution (BatchNorm, skip connections, gradient clipping). This four-step approach demonstrates depth of understanding.

An interviewer asks: "How does Lyapunov stability relate to the vanishing gradient problem?" The best answer is:

Bifurcation

**"What happens when the learning rate changes?"** This is a bifurcation question! A small LR gives slow but stable convergence. A large one gives divergence. In between: periodic oscillations in the loss. Knowledge of bifurcation theory explains the "magic" of warmup, cosine annealing, and learning rate schedules.

**Gradient descent as a dynamical system:** SGD is the iteration θ_{t+1} = θ_t − η∇L(θ_t). For small η: stable dynamics (convergence to a minimum). For large η: instability. Critical threshold: **η_max = 2/L**, where L is the Lipschitz constant. When η > 2/L a bifurcation occurs: the stable minimum transitions to oscillations, then divergence. **Learning rate warmup** gradually enters the stable region.

Training question	Bifurcation perspective
Why is warmup important?	Initial parameters are unstable (λ_max is large); a small η is needed to stay in the stable region
Why does cosine annealing work?	Gradually decreasing η lets the optimizer pass through the bifurcation and settle into a narrow minimum
Loss oscillates with a large batch?	Less noise (large batch) means less randomness, pushing the system closer to a deterministic bifurcation
Catastrophic forgetting?	New data shifts the loss landscape: an attractor bifurcation destroys old attractors

During neural network training with GD, the loss starts oscillating instead of converging. From a dynamical systems perspective this means:

Modeling

**"How can the spread of a virus through a social network be modeled?"** is a typical case question at ML companies. A strong answer starts not with choosing a neural architecture but with choosing the right dynamical model. SIR, SEIR, network epidemic models: each has its own assumptions and domain of applicability.

**Model selection framework:** 1) **What are we modeling?** Continuous or discrete dynamics? 2) **What level of detail?** Aggregate (ODE) vs agent-based (ABM) vs neural (Neural ODE). 3) **What data do we have?** Time series → model identification. 4) **What are we predicting?** Equilibrium, trajectory, tipping point? 5) **Interpretability?** Parametric model vs black box.

Modeling task	Recommended model	Rationale
Demand forecast for a product	Logistic curve + seasonality	Known growth shape with saturation
Viral content in a social network	Network SIR with R₀	Heterogeneous network structure matters
Financial time series	Stochastic ODE (SDE)	Determinism + noise + fat tails
Robot trajectory	Neural ODE / Hamiltonian NN	Physical conservation laws
Anomalies in logs	Echo State Network	Online learning, real-time

The task is to predict the trajectory of a physical object (a ball, a satellite). Which architecture is the right choice?

Applications

**The final interview round tests depth of understanding.** An interviewer at DeepMind or Anthropic wants to see the ability to connect abstract mathematics to concrete problems. "How can the training stability of GPT be improved?" A strong answer starts with Lyapunov exponents, moves to architectural solutions, and ends with concrete proposals.

**Key connections: Dynamical Systems to ML.** Lyapunov stability → gradient stability (vanishing/exploding). Bifurcations → learning rate schedules, loss landscape transitions. Attractors → minima in the loss landscape, memory in RNNs. Chaotic sensitivity → adversarial examples. KAM theory → generalization: stable tori in parameter space correspond to flat minima. Synchronization → gradient alignment in distributed training.

Typical interview question	Expected answer via DS
Why do Transformers scale better than RNNs?	RNN: sequential dynamics with λ > 0 for long sequences; Attention: parallel static operation, no temporal instability
What is the loss landscape?	A surface f: ℝ^N → ℝ; attractors = minima; minimum width ~ generalization (flat vs sharp minimum)
How does dropout work?	Stochastic perturbation: trains an ensemble of dynamical systems; output = mean field over trajectories
Why does Adam outperform SGD?	Adapts LR by curvature: normalizes the Jacobian, giving different stability scales in different directions

Current state of the field (2024-2025)

The intersection of dynamical systems and ML is one of the most active areas in 2024-2025. Selective State Space Models (Mamba) explicitly use linear dynamical systems theory. xLSTM rethinks LSTM with explicit stability control. Neural Operators (FNO, DeepONet) are operator-level Neural ODEs for PDEs. Physics-Informed Neural Networks (PINNs) include dynamical equations as constraints. These are not academic toys: product teams at Nvidia, Meta, and Google actively use these approaches.

Dynamical systems theory is irrelevant for ML interviews: only practical PyTorch skills matter

For research positions at top labs (DeepMind, Google Brain, Anthropic, OpenAI), mathematical depth is what separates candidates. Understanding dynamical systems explains why ML techniques work, not just how to apply them.

Average candidate: "BatchNorm stabilizes training." Strong candidate: "BatchNorm normalizes the Jacobian of each layer toward the identity, achieving a condition close to dynamical isometry, which prevents exponential decay or growth of the gradient." The second answer shows understanding of the mechanism, not just knowledge of the fact.

An interviewer asks: "Why do flat minima generalize better?" The best answer from a dynamical systems perspective:

Key ideas

**Stability = Lyapunov exponents of Jacobians:** vanishing gradient (λ < 0) vs exploding gradient (λ > 0); BatchNorm, skip connections, ortho init are solutions
**Bifurcations in training:** critical LR η_crit = 2/L_smooth; oscillating loss = bifurcation exceeded; warmup = entering the stable region
**Model selection:** physics → Hamiltonian NN; epidemic → SIR; time series → ESN/Neural ODE; each choice is justified by the dynamical structure of the problem
**DS → ML connections:** Lyapunov → gradient stability, bifurcations → LR schedules, attractors → minima, KAM → generalization, chaos → adversarial robustness

Вопросы для размышления

Prepare a 3-minute answer to: "How can the training stability of a very deep network be ensured?" Use concepts from this course.
"Why did Transformers replace RNNs?" Try answering through dynamical systems theory rather than just saying "attention works better".
The task is to model the spread of a meme on Twitter. Propose three models of different complexity and explain the trade-offs of each.

Связанные уроки

de-01