Convex Optimization
Bayesian Optimisation: Smart Hyperparameter Search
Training GPT-4 took thousands of GPU-hours. Finding hyperparameters by random search is wasteful. Bayesian optimisation can find the optimal hyperparameters in 10 - 20 attempts, where random search would need 100+. Google's and Facebook's AutoML systems use BayesOpt pervasively.
- Optuna, BoTorch: hyperparameter tuning for ML models
- AlphaFold 2: BayesOpt for selecting training parameters
- Drug discovery: Stokes et al. (2020) found a new antibiotic via BayesOpt
- Materials science: searching for new alloys with target properties
Gaussian Process: a Prior Distribution over Functions
Bayesian optimization guides the hyperparameter search for GPT-4: with 96 search dimensions and only 1,000 evaluations budget, a Gaussian process prior cuts cost by 10x. A **Gaussian process (GP)** is a distribution over functions: any finite collection of points has a joint Gaussian distribution. A GP is specified by a mean function m(x) and a covariance (kernel) function k(x,x'). After observations D = {(xᵢ, yᵢ)} the posterior GP is analytically tractable and serves as a surrogate model for the expensive function f.
**GP hyperparameters:** The length scale l and amplitude σ are tuned by maximising the marginal likelihood: log p(y|X,θ) = −½yᵀ(K+σₙ²I)⁻¹y − ½log|K+σₙ²I| − n/2 log(2π). This is a convex problem in θ! The GPy library (Python) does this automatically.
What does the posterior GP provide beyond the prediction μ*(x*)?
Acquisition Functions: EI, UCB, PI
An **acquisition function** α(x) indicates where to evaluate f next. It balances **exploitation** (high μ*(x)) and **exploration** (high σ*(x)). Three classical choices: Expected Improvement (EI), Upper Confidence Bound (UCB), Probability of Improvement (PI).
**EI in practice:** EI is the de facto standard in Bayesian optimisation libraries (Optuna, scikit-optimize, BoTorch). However, for multiple objectives (multi-objective BO) or batched queries (parallel BO) one uses qEI (quasi-Monte Carlo EI) or EHVI (Expected Hypervolume Improvement).
Why does UCB = μ*(x) + β·σ*(x) balance exploration and exploitation?
The Bayesian Optimisation Algorithm and Applications
The **Bayesian optimisation (BayesOpt) algorithm** iteratively updates the surrogate model (GP) and maximises the acquisition function to select the next evaluation point. It is efficient when the number of queries is small (10 - 100) and each is expensive. Applications: hyperparameter tuning, drug discovery, materials design.
**BayesOpt in real tasks:** Optuna (auto-selects CMA-ES or TPE), BoTorch (Facebook, GPU-accelerated), scikit-optimize - the three main libraries. AutoML systems (Auto-sklearn, NAS at Google Brain) use BayesOpt for architecture search. Application in pharmaceuticals: BayesOpt discovered new antibiotics (Stokes et al., 2020).
In which regime is Bayesian optimisation particularly advantageous over random search or a grid?
Key Ideas
- GP: distribution over functions; posterior GP gives μ*(x) and σ*(x) - prediction and uncertainty
- EI = E[max(f(x)−f*, 0)]: most popular acquisition function, analytically tractable
- UCB = μ*(x) + βσ*(x): simple, with theoretical regret guarantees O(√T log T)
- BayesOpt: GP + acquisition → iterative smart search in O(10 - 100) queries
- Applications: ML hyperparameters, drug discovery, materials, A/B testing
Related Topics
Bayesian optimisation connects statistics, convex optimisation, and ML.
- Stochastic Optimisation — Both deal with noisy problems; BayesOpt for expensive functions, SGD for cheap ones
- Online Learning — BayesOpt with GP-UCB has regret guarantees analogous to online learning
- Large-Scale Optimisation — BayesOpt selects hyperparameters for large-scale optimisation algorithms
Вопросы для размышления
- How do one choose a GP kernel for Bayesian optimisation of neural network hyperparameters? What is ARD (Automatic Relevance Determination) and why is it useful?
- Why does Bayesian optimisation scale poorly to high dimensions (d > 20)? What extensions exist (REMBO, MACE)?
- How is parallel Bayesian optimisation (batch BO) implemented? What is qEI and why is simply applying EI multiple times incorrect?