Convex Optimization

Differentiable Programming

In 2017 Google trained a neural-network optimizer that beat hand-tuned Adam on 80+ tasks. Airbus optimizes the A350 wing geometry through gradients of the aerodynamics equations, saving 4% fuel. **Differentiable programming** turns any program - simulator, renderer, physics engine - into a function that admits a backward pass and can be optimized.

**MAML and meta-learning**: Apple and Google use similar approaches for model personalization-fast adaptation to a user without retraining from scratch
**Differentiable simulators**: Tesla FSD is partially trained through differentiable road environment models-gradients from simulation replace expensive real-world trials
**Neural ODE**: wraps an ODE solver in a differentiable layer-used in molecular dynamics modeling and time series

Differentiable Programming: Programs as Computation Graphs

**Differentiable Programming (DP)** is a paradigm in which an entire program is treated as a computation graph with differentiable operations. If every operation has a defined derivative, then through the chain rule (autograd) we can compute the gradient of the loss with respect to any program parameter-including the algorithm structure, hyperparameters, or optimizer steps.

In classical ML we differentiate with respect to neural network weights. In DP we go further: we differentiate with respect to **hyperparameters** (learning rate, architecture), **algorithms** (physics simulation steps), **programs** (iterations of an inner optimizer). This opens the possibility of learning not just "what" but "how".

Any operation for which a backpropagation rule is defined (VJP-vector-Jacobian product) can be embedded in a differentiable graph. JAX, PyTorch, and TensorFlow support hundreds of operations out of the box. Custom solvers (LP, QP, manifold optimization) are wrapped via `custom_vjp` / `register_hook` with analytically derived gradients.

What is fundamentally new about differentiable programming compared to ordinary neural network autograd?

Implicit Differentiation and Bi-Level Optimization

Many tasks take the form: **find θ minimizing an outer objective, where an inner optimization over x(θ) is a constraint**. This is **bi-level optimization**. Computing the gradient requires differentiating through argmin-nontrivial because argmin has no closed form.

Implicit differentiation uses the **implicit function theorem**: if the first-order condition holds and the Hessian is non-singular, then x*(θ) is a differentiable function of θ. The key computational advantage: no need to store the graph of all inner optimizer steps-just solve one linear system H⁻¹v at the end.

**higher** (PyTorch): unrolls the optimizer and retains the graph-for exact unrolling. **jaxopt**: implements implicit differentiation via `implicit_diff=True` for iterative solvers. **cvxpylayers**: wraps CVXPY problems (LP, QP, SDP) in a differentiable layer via KKT conditions-the same first-order equations as implicit differentiation.

The main advantage of implicit differentiation over optimizer unrolling:

Optimizer Unrolling vs Implicit Differentiation

The two main approaches to differentiating through inner optimization have different trade-offs in memory, accuracy, and implementation complexity. The choice depends on the number of iterations, model size, and gradient accuracy requirements.

A compromise approach: unroll only the last K iterations instead of all T. This limits memory to O(K) and reduces vanishing gradients, but introduces gradient bias. Used in MAML with a small number of inner steps (usually 5-10) and in training score functions in diffusion models.

In bi-level optimization where the inner problem is an LP (linear program) with thousands of simplex iterations. Which approach is the only viable one?

Applications: MAML, Physics Simulation, NAS

Differentiable programming has become the foundation of several breakthrough ML directions in recent years-from meta-learning to differentiable physics and neural architecture search.

Why does MAML use `create_graph=True` when computing inner loop gradients?

Key Ideas

**Differentiable programming**-paradigm where the entire pipeline (simulator, optimizer, architecture) is treated as a differentiable computation graph
**Bi-level optimization**: outer problem optimizes over the inner problem's solution; requires differentiating through argmin
**Implicit differentiation**: gradient through optimality conditions (KKT/stationarity)-O(1) memory, exact at optimum
**Unrolling**: T optimizer steps unrolled into a graph-simple, but O(T) memory and vanishing/exploding gradient issues

Вопросы для размышления

Which problems in the practitioner's domain can be formulated as bi-level optimization?
What is the fundamental difference between MAML (unrolling) and meta-learning methods via implicit differentiation?
How does differentiable programming blur the boundary between 'model' and 'algorithm' in ML?

Связанные уроки

calc-18-partial