Convex Optimization

Distributed Optimization

How can thousands of machines train a model simultaneously when each sees only a fraction of the data - and still produce the same solution as centralized training?

Google PaLM: training 540 billion parameters across 6144 TPUs via consensus protocol - global AllReduce synchronization every 100 steps
Federated Learning: Google Keyboard trains on 500 million phones without transferring personal data - each device solves a local problem, the server averages weights
Smart Grid: balancing 10,000 nodes via decentralized ADMM achieves optimal power allocation without a central dispatcher
Finance: 300 banks jointly train an anti-fraud model through consensus optimization without exposing transaction data to competitors

Предварительные знания

Convex functions and subgradients
Lagrange multipliers
Gradient descent

ADMM: Alternating Direction Method

ADMM (Alternating Direction Method of Multipliers, Glowinski-Marrocco 1975, Gabay-Mercier 1976) is a general method for convex problems with separable structure. It combines coordinate decomposition with the strong convergence properties of multiplier methods. ADMM became central to large-model training in the 2010s: Google trained PaLM on 6144 TPUs through consensus ADMM with a global sync every 100 steps.

ADMM stopping criterion requires both conditions: primal residual ||Ax + Bz - c|| < eps_prim AND dual rho*||A^T(z^{k+1} - z^k)|| < eps_dual. Boyd et al. recommend adaptive tolerances with absolute and relative components.

Choice of rho is critical: values from 0.1 to 10 work for most problems. Adaptive rule: if the primal residual is much larger than the dual, double rho; otherwise halve it. For quadratic f and g, the optimal rho is the geometric mean of the Lipschitz constants.

Which condition is sufficient for ADMM convergence at rho > 0?

Consensus Algorithms

Consensus optimization is the distributed formulation in which N agents hold local copies x_i of the parameter and converge to agreement via local communication. It arises in federated learning, multi-agent control, and sensor networks - wherever centralization is impossible due to privacy, geographic distribution, or fault tolerance.

EXTRA (Shi et al. 2015) and DIGing (Nedic et al. 2017) correct the systematic bias of DGD at constant step size. The trick is a local correction through the difference of two consecutive mixing steps, achieving exact convergence in a distributed setting.

Federated averaging (FedAvg) is a special case of local ADMM with multiple local steps between communication rounds. Reduced communication comes at the cost of bias under heterogeneous data - the client drift phenomenon. FedProx and SCAFFOLD fix this with proximal terms or direction correction.

What is the key advantage of consensus formulation through ADMM?

Network Topology Effects on Convergence

In decentralized methods, the communication graph between agents fundamentally affects convergence speed. The spectral properties of the mixing matrix W (doubly stochastic, consistent with the graph) determine how quickly local estimates reach global consensus. Dense graphs deliver fast consensus at high per-node communication cost; sparse networks are cheaper per node but need more iterations.

Metropolis weights: W_ij = 1/(1 + max(d_i, d_j)) on edges and W_ii = 1 - sum_{j in N(i)} W_ij. They guarantee doubly stochastic structure without knowing the global graph - each node only needs neighbor degrees.

What sets the speed of consensus in a decentralized network?

Key ideas

ADMM splits min f(x)+g(z) s.t. Ax+Bz=c into three steps: x-minimization, z-minimization, multiplier update
Consensus formulation with x_i = z lets N agents minimize local functions f_i in parallel
Convergence O(1/k) in residuals is guaranteed for convex f, g at any rho > 0; linear rate with strong convexity
Decentralized DGD operates without a central server via a doubly stochastic mixing matrix W over a graph
Stopping criterion requires both primal and dual residuals to fall below threshold simultaneously

Connections to Other Areas

Distributed optimization bridges convex analysis, graph theory, and communication systems.

Federated Learning — Related topic
Duality theory — Related topic
Graph theory — Related topic

Вопросы для размышления

Why does the parameter rho in ADMM affect convergence speed even though any rho > 0 theoretically guarantees convergence?
How are the dual variables lambda in ADMM related to equilibrium prices in economic models?
What is the fundamental difference between DGD and ADMM in terms of communication requirements between agents?

ADMM: Alternating Direction Method

Which condition is sufficient for ADMM convergence at rho > 0?

Consensus Algorithms

What is the key advantage of consensus formulation through ADMM?

Network Topology Effects on Convergence

What sets the speed of consensus in a decentralized network?

Key ideas

ADMM splits min f(x)+g(z) s.t. Ax+Bz=c into three steps: x-minimization, z-minimization, multiplier update

Consensus formulation with x_i = z lets N agents minimize local functions f_i in parallel

Convergence O(1/k) in residuals is guaranteed for convex f, g at any rho > 0; linear rate with strong convexity

Decentralized DGD operates without a central server via a doubly stochastic mixing matrix W over a graph

Stopping criterion requires both primal and dual residuals to fall below threshold simultaneously