Convex Optimization
Large-Scale Optimisation: From Cluster to Supercomputer
GPT-4 was trained on thousands of GPUs via AllReduce SGD. LibSVM (100M+ downloads) uses SMO - coordinate descent. scikit-learn LASSO uses CD internally. PyTorch L-BFGS - for physics simulations. Algorithm choice determines whether the training takes hours or days.
- AllReduce SGD (PyTorch DDP, Horovod): distributed training on GPU clusters
- FedAvg in Google Keyboard: millions of Android devices without data transfer
- LibSVM, scikit-learn: SMO and CD as internal solvers for SVM and LASSO
- L-BFGS in physics simulations: molecular dynamics, optics
Distributed Optimisation and Federated Learning
The Adam optimizer (Kingma 2014, 100,000+ citations) is the standard for training GPT-4: β₁=0.9, β₂=0.999, ε=10⁻⁸. **Distributed optimisation** splits the problem min f(x) = 1/N ∑ᵢ fᵢ(x) across cluster nodes. **Federated learning (FedAvg)** is a special case: data never leaves devices (smartphones), and the server averages updated models. Key challenges: communication latency, heterogeneous data, privacy.
**Federated Learning in production:** Google uses FedAvg to train keyboard autocomplete on 500M+ Android devices. Apple applies FL to Siri on iOS. Key challenges: 1. client drift with non-IID data 2. communication efficiency (quantisation, sparsification) 3. privacy (differential privacy + secure aggregation).
What is the main problem with FedAvg when client data is heterogeneous (non-IID)?
Second-Order Methods: L-BFGS and Scalable Newton
**Newton's method** uses the Hessian: x_{k+1} = x_k − H⁻¹∇f(x_k). It has quadratic convergence near the minimum but requires O(d²) memory and O(d³) for Hessian inversion. **L-BFGS** approximates H⁻¹ using the last m pairs (Δx, Δg), requiring O(md) memory. It is the standard algorithm for convex problems of moderate dimension.
**L-BFGS in production:** PyTorch includes torch.optim.LBFGS (but not recommended for DL due to incompatibility with mini-batches). For convex problems (logistic regression, SVM, GP) L-BFGS in scikit-learn is the standard. Stochastic L-BFGS (oLBFGS) and online Hessian estimation are active research directions for DL.
What is the key advantage of L-BFGS over full Newton's method?
Coordinate Descent: Parallelism Without Shared Memory
**Coordinate descent (CD)** minimises over one variable at a time, keeping the rest fixed. For separable structure in f the coordinate update often has a closed form. **Hogwild!** (Niu et al., 2011) - parallel CD without locks: threads write simultaneously, and for sparse data this works correctly!
**When to choose CD vs SGD vs L-BFGS:** 1. Sparse data, LASSO/SVM - CD (SMO). 2. Large neural networks, mini-batches - SGD/Adam. 3. Medium-sized convex problems (d ≤ 10⁶), high accuracy needed - L-BFGS. 4. No gradients, expensive function, d ≤ 20 - BayesOpt. Algorithm selection is itself an optimisation problem!
Why does Hogwild! (parallel CD without locks) work correctly for sparse problems?
Key Ideas
- FedAvg: local SGD steps + averaging; client drift with non-IID data → FedProx, SCAFFOLD
- Newton's method: quadratic convergence, O(d³) cost; practically infeasible for d > 10⁴
- L-BFGS: superlinear convergence, O(md) memory; standard for convex problems d ≤ 10⁶
- Coordinate descent: analytic updates for LASSO (soft threshold), SVM (SMO)
- Hogwild!: parallel CD without locks is correct for sparse data
Related Topics
Large-scale optimisation brings together all methods from the course.
- Proximal Methods — CD for LASSO = proximal step per coordinate; ADMM scales to clusters
- Stochastic Optimisation — AllReduce SGD - distributed mini-batch SGD; variance reduction in FedAvg
- Bayesian Optimisation — BayesOpt selects hyperparameters for all methods in this lesson
Вопросы для размышления
- Why does standard FedAvg diverge with strongly heterogeneous data? How does FedProx (adding a proximal term) fix this mathematically?
- Under what condition is coordinate descent faster than SGD? Construct a concrete example with a sparse matrix A.
- What is the practical difference between distributed SGD (AllReduce) and federated learning? Why can't Google use AllReduce to train the keyboard?