Dynamical Systems
Dynamical Systems in ML Interviews
Of 100 candidates for a Research Scientist role at Google DeepMind, 90 know PyTorch, 50 know Transformer architectures, and 10 can explain why those architectures work through mathematics. Those 10 are the ones who get offers. Dynamical systems is one of the primary differentiators.
- **Google Brain / DeepMind:** questions on spectral norm regularization, GAN training stability, Lyapunov stability in RL
- **Anthropic:** questions on generalization mechanisms, the loss landscape, why certain initializations work better
- **Nvidia Research:** Physics-Informed NNs, Neural ODE for simulation, Hamiltonian NN for molecular dynamics
Предварительные знания
Stability
**"Explain what stability means in an ML context"** is a question often heard at DeepMind, Google Brain, or for a Research Scientist role. Most candidates answer only about training. But behind it lie three distinct concepts: stability of equilibria (Lyapunov), training stability (gradient stability), and stability under perturbations (robustness).
**Three kinds of stability in ML:** 1) **Training stability:** vanishing/exploding gradients correspond to Lyapunov exponents of layer Jacobians; 2) **Structural stability:** the model behaves predictably under small input changes (adversarial robustness); 3) **Lyapunov stability:** in the context of RNNs or dynamical systems, an equilibrium is stable if small initial perturbations remain small.
| Interview question | Core concept | Strong answer |
|---|---|---|
| Why are deep networks hard to train? | Lyapunov exponents of Jacobians | Vanishing/exploding gradient = instability in the matrix product |
| What is adversarial robustness? | Sensitivity to initial conditions | Small input perturbations cause large output changes (λ > 0) |
| Why does BatchNorm work? | Stabilization of dynamics | Normalizes the Jacobian toward the identity, stabilizing gradient propagation |
| Why use skip connections? | Neutral stability | Adds I to the Jacobian, so eigenvalues ≥ 1 and gradients don't vanish |
**Interview answer strategy:** start with the formal definition (Lyapunov), connect it to a specific ML context (gradient stability), give a formula or code example, and mention a practical solution (BatchNorm, skip connections, gradient clipping). This four-step approach demonstrates depth of understanding.
An interviewer asks: "How does Lyapunov stability relate to the vanishing gradient problem?" The best answer is:
Bifurcation
**"What happens when the learning rate changes?"** This is a bifurcation question! A small LR gives slow but stable convergence. A large one gives divergence. In between: periodic oscillations in the loss. Knowledge of bifurcation theory explains the "magic" of warmup, cosine annealing, and learning rate schedules.
**Gradient descent as a dynamical system:** SGD is the iteration θ_{t+1} = θ_t − η∇L(θ_t). For small η: stable dynamics (convergence to a minimum). For large η: instability. Critical threshold: **η_max = 2/L**, where L is the Lipschitz constant. When η > 2/L a bifurcation occurs: the stable minimum transitions to oscillations, then divergence. **Learning rate warmup** gradually enters the stable region.
| Training question | Bifurcation perspective |
|---|---|
| Why is warmup important? | Initial parameters are unstable (λ_max is large); a small η is needed to stay in the stable region |
| Why does cosine annealing work? | Gradually decreasing η lets the optimizer pass through the bifurcation and settle into a narrow minimum |
| Loss oscillates with a large batch? | Less noise (large batch) means less randomness, pushing the system closer to a deterministic bifurcation |
| Catastrophic forgetting? | New data shifts the loss landscape: an attractor bifurcation destroys old attractors |
During neural network training with GD, the loss starts oscillating instead of converging. From a dynamical systems perspective this means:
Modeling
**"How can the spread of a virus through a social network be modeled?"** is a typical case question at ML companies. A strong answer starts not with choosing a neural architecture but with choosing the right dynamical model. SIR, SEIR, network epidemic models: each has its own assumptions and domain of applicability.
**Model selection framework:** 1) **What are we modeling?** Continuous or discrete dynamics? 2) **What level of detail?** Aggregate (ODE) vs agent-based (ABM) vs neural (Neural ODE). 3) **What data do we have?** Time series → model identification. 4) **What are we predicting?** Equilibrium, trajectory, tipping point? 5) **Interpretability?** Parametric model vs black box.
| Modeling task | Recommended model | Rationale |
|---|---|---|
| Demand forecast for a product | Logistic curve + seasonality | Known growth shape with saturation |
| Viral content in a social network | Network SIR with R₀ | Heterogeneous network structure matters |
| Financial time series | Stochastic ODE (SDE) | Determinism + noise + fat tails |
| Robot trajectory | Neural ODE / Hamiltonian NN | Physical conservation laws |
| Anomalies in logs | Echo State Network | Online learning, real-time |
The task is to predict the trajectory of a physical object (a ball, a satellite). Which architecture is the right choice?
Applications
**The final interview round tests depth of understanding.** An interviewer at DeepMind or Anthropic wants to see the ability to connect abstract mathematics to concrete problems. "How can the training stability of GPT be improved?" A strong answer starts with Lyapunov exponents, moves to architectural solutions, and ends with concrete proposals.
**Key connections: Dynamical Systems to ML.** Lyapunov stability → gradient stability (vanishing/exploding). Bifurcations → learning rate schedules, loss landscape transitions. Attractors → minima in the loss landscape, memory in RNNs. Chaotic sensitivity → adversarial examples. KAM theory → generalization: stable tori in parameter space correspond to flat minima. Synchronization → gradient alignment in distributed training.
| Typical interview question | Expected answer via DS |
|---|---|
| Why do Transformers scale better than RNNs? | RNN: sequential dynamics with λ > 0 for long sequences; Attention: parallel static operation, no temporal instability |
| What is the loss landscape? | A surface f: ℝ^N → ℝ; attractors = minima; minimum width ~ generalization (flat vs sharp minimum) |
| How does dropout work? | Stochastic perturbation: trains an ensemble of dynamical systems; output = mean field over trajectories |
| Why does Adam outperform SGD? | Adapts LR by curvature: normalizes the Jacobian, giving different stability scales in different directions |
Current state of the field (2024-2025)
The intersection of dynamical systems and ML is one of the most active areas in 2024-2025. Selective State Space Models (Mamba) explicitly use linear dynamical systems theory. xLSTM rethinks LSTM with explicit stability control. Neural Operators (FNO, DeepONet) are operator-level Neural ODEs for PDEs. Physics-Informed Neural Networks (PINNs) include dynamical equations as constraints. These are not academic toys: product teams at Nvidia, Meta, and Google actively use these approaches.
Dynamical systems theory is irrelevant for ML interviews: only practical PyTorch skills matter
For research positions at top labs (DeepMind, Google Brain, Anthropic, OpenAI), mathematical depth is what separates candidates. Understanding dynamical systems explains why ML techniques work, not just how to apply them.
Average candidate: "BatchNorm stabilizes training." Strong candidate: "BatchNorm normalizes the Jacobian of each layer toward the identity, achieving a condition close to dynamical isometry, which prevents exponential decay or growth of the gradient." The second answer shows understanding of the mechanism, not just knowledge of the fact.
An interviewer asks: "Why do flat minima generalize better?" The best answer from a dynamical systems perspective:
Key ideas
- **Stability = Lyapunov exponents of Jacobians:** vanishing gradient (λ < 0) vs exploding gradient (λ > 0); BatchNorm, skip connections, ortho init are solutions
- **Bifurcations in training:** critical LR η_crit = 2/L_smooth; oscillating loss = bifurcation exceeded; warmup = entering the stable region
- **Model selection:** physics → Hamiltonian NN; epidemic → SIR; time series → ESN/Neural ODE; each choice is justified by the dynamical structure of the problem
- **DS → ML connections:** Lyapunov → gradient stability, bifurcations → LR schedules, attractors → minima, KAM → generalization, chaos → adversarial robustness
Related topics
This lesson brings the whole course together as practical application:
- Bifurcations — Bifurcations in gradient descent as the LR changes: a direct application of dyn-04
- Neurodynamics — Hopfield networks, Neural ODE, and synchronization from dyn-11 all appear in ML interviews
- Dynamical Systems in ML — Technical details of Neural ODE, ESN, and dynamical isometry
Вопросы для размышления
- Prepare a 3-minute answer to: "How can the training stability of a very deep network be ensured?" Use concepts from this course.
- "Why did Transformers replace RNNs?" Try answering through dynamical systems theory rather than just saying "attention works better".
- The task is to model the spread of a meme on Twitter. Propose three models of different complexity and explain the trade-offs of each.