Convex Optimization

Bayesian Optimisation: Smart Hyperparameter Search

Training GPT-4 took thousands of GPU-hours. Finding hyperparameters by random search is wasteful. Bayesian optimisation can find the optimal hyperparameters in 10 - 20 attempts, where random search would need 100+. Google's and Facebook's AutoML systems use BayesOpt pervasively.

  • Optuna, BoTorch: hyperparameter tuning for ML models
  • AlphaFold 2: BayesOpt for selecting training parameters
  • Drug discovery: Stokes et al. (2020) found a new antibiotic via BayesOpt
  • Materials science: searching for new alloys with target properties

Gaussian Process: a Prior Distribution over Functions

Bayesian optimization guides the hyperparameter search for GPT-4: with 96 search dimensions and only 1,000 evaluations budget, a Gaussian process prior cuts cost by 10x. A **Gaussian process (GP)** is a distribution over functions: any finite collection of points has a joint Gaussian distribution. A GP is specified by a mean function m(x) and a covariance (kernel) function k(x,x'). After observations D = {(xᵢ, yᵢ)} the posterior GP is analytically tractable and serves as a surrogate model for the expensive function f.

**GP hyperparameters:** The length scale l and amplitude σ are tuned by maximising the marginal likelihood: log p(y|X,θ) = −½yᵀ(K+σₙ²I)⁻¹y − ½log|K+σₙ²I| − n/2 log(2π). This is a convex problem in θ! The GPy library (Python) does this automatically.

What does the posterior GP provide beyond the prediction μ*(x*)?

Acquisition Functions: EI, UCB, PI

An **acquisition function** α(x) indicates where to evaluate f next. It balances **exploitation** (high μ*(x)) and **exploration** (high σ*(x)). Three classical choices: Expected Improvement (EI), Upper Confidence Bound (UCB), Probability of Improvement (PI).

**EI in practice:** EI is the de facto standard in Bayesian optimisation libraries (Optuna, scikit-optimize, BoTorch). However, for multiple objectives (multi-objective BO) or batched queries (parallel BO) one uses qEI (quasi-Monte Carlo EI) or EHVI (Expected Hypervolume Improvement).

Why does UCB = μ*(x) + β·σ*(x) balance exploration and exploitation?

The Bayesian Optimisation Algorithm and Applications

The **Bayesian optimisation (BayesOpt) algorithm** iteratively updates the surrogate model (GP) and maximises the acquisition function to select the next evaluation point. It is efficient when the number of queries is small (10 - 100) and each is expensive. Applications: hyperparameter tuning, drug discovery, materials design.

**BayesOpt in real tasks:** Optuna (auto-selects CMA-ES or TPE), BoTorch (Facebook, GPU-accelerated), scikit-optimize - the three main libraries. AutoML systems (Auto-sklearn, NAS at Google Brain) use BayesOpt for architecture search. Application in pharmaceuticals: BayesOpt discovered new antibiotics (Stokes et al., 2020).

In which regime is Bayesian optimisation particularly advantageous over random search or a grid?

Key Ideas

  • GP: distribution over functions; posterior GP gives μ*(x) and σ*(x) - prediction and uncertainty
  • EI = E[max(f(x)−f*, 0)]: most popular acquisition function, analytically tractable
  • UCB = μ*(x) + βσ*(x): simple, with theoretical regret guarantees O(√T log T)
  • BayesOpt: GP + acquisition → iterative smart search in O(10 - 100) queries
  • Applications: ML hyperparameters, drug discovery, materials, A/B testing

Related Topics

Bayesian optimisation connects statistics, convex optimisation, and ML.

  • Stochastic Optimisation — Both deal with noisy problems; BayesOpt for expensive functions, SGD for cheap ones
  • Online Learning — BayesOpt with GP-UCB has regret guarantees analogous to online learning
  • Large-Scale Optimisation — BayesOpt selects hyperparameters for large-scale optimisation algorithms

Вопросы для размышления

  • How do one choose a GP kernel for Bayesian optimisation of neural network hyperparameters? What is ARD (Automatic Relevance Determination) and why is it useful?
  • Why does Bayesian optimisation scale poorly to high dimensions (d > 20)? What extensions exist (REMBO, MACE)?
  • How is parallel Bayesian optimisation (batch BO) implemented? What is qEI and why is simply applying EI multiple times incorrect?

Связанные уроки

  • stat-11-bayesian
Bayesian Optimisation: Smart Hyperparameter Search

0

1

Sign In