Convex Optimization
Differentiable Programming
In 2017 Google trained a neural-network optimizer that beat hand-tuned Adam on 80+ tasks. Airbus optimizes the A350 wing geometry through gradients of the aerodynamics equations, saving 4% fuel. **Differentiable programming** turns any program - simulator, renderer, physics engine - into a function that admits a backward pass and can be optimized.
- **MAML and meta-learning**: Apple and Google use similar approaches for model personalization-fast adaptation to a user without retraining from scratch
- **Differentiable simulators**: Tesla FSD is partially trained through differentiable road environment models-gradients from simulation replace expensive real-world trials
- **Neural ODE**: wraps an ODE solver in a differentiable layer-used in molecular dynamics modeling and time series
Differentiable Programming: Programs as Computation Graphs
**Differentiable Programming (DP)** is a paradigm in which an entire program is treated as a computation graph with differentiable operations. If every operation has a defined derivative, then through the chain rule (autograd) we can compute the gradient of the loss with respect to any program parameter-including the algorithm structure, hyperparameters, or optimizer steps.
In classical ML we differentiate with respect to neural network weights. In DP we go further: we differentiate with respect to **hyperparameters** (learning rate, architecture), **algorithms** (physics simulation steps), **programs** (iterations of an inner optimizer). This opens the possibility of learning not just "what" but "how".
Any operation for which a backpropagation rule is defined (VJP-vector-Jacobian product) can be embedded in a differentiable graph. JAX, PyTorch, and TensorFlow support hundreds of operations out of the box. Custom solvers (LP, QP, manifold optimization) are wrapped via `custom_vjp` / `register_hook` with analytically derived gradients.
What is fundamentally new about differentiable programming compared to ordinary neural network autograd?
Implicit Differentiation and Bi-Level Optimization
Many tasks take the form: **find θ minimizing an outer objective, where an inner optimization over x(θ) is a constraint**. This is **bi-level optimization**. Computing the gradient requires differentiating through argmin-nontrivial because argmin has no closed form.
Implicit differentiation uses the **implicit function theorem**: if the first-order condition holds and the Hessian is non-singular, then x*(θ) is a differentiable function of θ. The key computational advantage: no need to store the graph of all inner optimizer steps-just solve one linear system H⁻¹v at the end.
**higher** (PyTorch): unrolls the optimizer and retains the graph-for exact unrolling. **jaxopt**: implements implicit differentiation via `implicit_diff=True` for iterative solvers. **cvxpylayers**: wraps CVXPY problems (LP, QP, SDP) in a differentiable layer via KKT conditions-the same first-order equations as implicit differentiation.
The main advantage of implicit differentiation over optimizer unrolling:
Optimizer Unrolling vs Implicit Differentiation
The two main approaches to differentiating through inner optimization have different trade-offs in memory, accuracy, and implementation complexity. The choice depends on the number of iterations, model size, and gradient accuracy requirements.
A compromise approach: unroll only the last K iterations instead of all T. This limits memory to O(K) and reduces vanishing gradients, but introduces gradient bias. Used in MAML with a small number of inner steps (usually 5-10) and in training score functions in diffusion models.
In bi-level optimization where the inner problem is an LP (linear program) with thousands of simplex iterations. Which approach is the only viable one?
Applications: MAML, Physics Simulation, NAS
Differentiable programming has become the foundation of several breakthrough ML directions in recent years-from meta-learning to differentiable physics and neural architecture search.
Why does MAML use `create_graph=True` when computing inner loop gradients?
Key Ideas
- **Differentiable programming**-paradigm where the entire pipeline (simulator, optimizer, architecture) is treated as a differentiable computation graph
- **Bi-level optimization**: outer problem optimizes over the inner problem's solution; requires differentiating through argmin
- **Implicit differentiation**: gradient through optimality conditions (KKT/stationarity)-O(1) memory, exact at optimum
- **Unrolling**: T optimizer steps unrolled into a graph-simple, but O(T) memory and vanishing/exploding gradient issues
Related Topics
Differentiable programming bridges classical optimization with modern ML:
- Optimization on Manifolds — Riemannian Autodiff is a special case of DP where the graph includes manifold operations
- Federated Optimization — Personalized FL uses MAML-like bi-level schemes for client adaptation
- Optimization in Reinforcement Learning — Policy gradient is a special case of DP: REINFORCE differentiates through stochastic program execution
Вопросы для размышления
- Which problems in the practitioner's domain can be formulated as bi-level optimization?
- What is the fundamental difference between MAML (unrolling) and meta-learning methods via implicit differentiation?
- How does differentiable programming blur the boundary between 'model' and 'algorithm' in ML?