Convex Optimization

Large-Scale Optimisation: From Cluster to Supercomputer

GPT-4 was trained on thousands of GPUs via AllReduce SGD. LibSVM (100M+ downloads) uses SMO - coordinate descent. scikit-learn LASSO uses CD internally. PyTorch L-BFGS - for physics simulations. Algorithm choice determines whether the training takes hours or days.

AllReduce SGD (PyTorch DDP, Horovod): distributed training on GPU clusters
FedAvg in Google Keyboard: millions of Android devices without data transfer
LibSVM, scikit-learn: SMO and CD as internal solvers for SVM and LASSO
L-BFGS in physics simulations: molecular dynamics, optics

Distributed Optimisation and Federated Learning

The Adam optimizer (Kingma 2014, 100,000+ citations) is the standard for training GPT-4: β₁=0.9, β₂=0.999, ε=10⁻⁸. **Distributed optimisation** splits the problem min f(x) = 1/N ∑ᵢ fᵢ(x) across cluster nodes. **Federated learning (FedAvg)** is a special case: data never leaves devices (smartphones), and the server averages updated models. Key challenges: communication latency, heterogeneous data, privacy.

**Federated Learning in production:** Google uses FedAvg to train keyboard autocomplete on 500M+ Android devices. Apple applies FL to Siri on iOS. Key challenges: 1. client drift with non-IID data 2. communication efficiency (quantisation, sparsification) 3. privacy (differential privacy + secure aggregation).

What is the main problem with FedAvg when client data is heterogeneous (non-IID)?

Second-Order Methods: L-BFGS and Scalable Newton

**Newton's method** uses the Hessian: x_{k+1} = x_k − H⁻¹∇f(x_k). It has quadratic convergence near the minimum but requires O(d²) memory and O(d³) for Hessian inversion. **L-BFGS** approximates H⁻¹ using the last m pairs (Δx, Δg), requiring O(md) memory. It is the standard algorithm for convex problems of moderate dimension.

**L-BFGS in production:** PyTorch includes torch.optim.LBFGS (but not recommended for DL due to incompatibility with mini-batches). For convex problems (logistic regression, SVM, GP) L-BFGS in scikit-learn is the standard. Stochastic L-BFGS (oLBFGS) and online Hessian estimation are active research directions for DL.

What is the key advantage of L-BFGS over full Newton's method?

Coordinate Descent: Parallelism Without Shared Memory

**Coordinate descent (CD)** minimises over one variable at a time, keeping the rest fixed. For separable structure in f the coordinate update often has a closed form. **Hogwild!** (Niu et al., 2011) - parallel CD without locks: threads write simultaneously, and for sparse data this works correctly!

**When to choose CD vs SGD vs L-BFGS:** 1. Sparse data, LASSO/SVM - CD (SMO). 2. Large neural networks, mini-batches - SGD/Adam. 3. Medium-sized convex problems (d ≤ 10⁶), high accuracy needed - L-BFGS. 4. No gradients, expensive function, d ≤ 20 - BayesOpt. Algorithm selection is itself an optimisation problem!

Why does Hogwild! (parallel CD without locks) work correctly for sparse problems?

Key Ideas

FedAvg: local SGD steps + averaging; client drift with non-IID data → FedProx, SCAFFOLD
Newton's method: quadratic convergence, O(d³) cost; practically infeasible for d > 10⁴
L-BFGS: superlinear convergence, O(md) memory; standard for convex problems d ≤ 10⁶
Coordinate descent: analytic updates for LASSO (soft threshold), SVM (SMO)
Hogwild!: parallel CD without locks is correct for sparse data

Вопросы для размышления

Why does standard FedAvg diverge with strongly heterogeneous data? How does FedProx (adding a proximal term) fix this mathematically?
Under what condition is coordinate descent faster than SGD? Construct a concrete example with a sparse matrix A.
What is the practical difference between distributed SGD (AllReduce) and federated learning? Why can't Google use AllReduce to train the keyboard?

Связанные уроки

alg-05-merge-sort