Convex Optimization

Bayesian Optimisation: Smart Hyperparameter Search

Training GPT-4 took thousands of GPU-hours. Finding hyperparameters by random search is wasteful. Bayesian optimisation can find the optimal hyperparameters in 10 - 20 attempts, where random search would need 100+. Google's and Facebook's AutoML systems use BayesOpt pervasively.

Optuna, BoTorch: hyperparameter tuning for ML models
AlphaFold 2: BayesOpt for selecting training parameters
Drug discovery: Stokes et al. (2020) found a new antibiotic via BayesOpt
Materials science: searching for new alloys with target properties

Gaussian Process: a Prior Distribution over Functions

Bayesian optimization guides the hyperparameter search for GPT-4: with 96 search dimensions and only 1,000 evaluations budget, a Gaussian process prior cuts cost by 10x. A **Gaussian process (GP)** is a distribution over functions: any finite collection of points has a joint Gaussian distribution. A GP is specified by a mean function m(x) and a covariance (kernel) function k(x,x'). After observations D = {(xᵢ, yᵢ)} the posterior GP is analytically tractable and serves as a surrogate model for the expensive function f.

**GP hyperparameters:** The length scale l and amplitude σ are tuned by maximising the marginal likelihood: log p(y|X,θ) = −½yᵀ(K+σₙ²I)⁻¹y − ½log|K+σₙ²I| − n/2 log(2π). This is a convex problem in θ! The GPy library (Python) does this automatically.

What does the posterior GP provide beyond the prediction μ*(x*)?

Acquisition Functions: EI, UCB, PI

An **acquisition function** α(x) indicates where to evaluate f next. It balances **exploitation** (high μ*(x)) and **exploration** (high σ*(x)). Three classical choices: Expected Improvement (EI), Upper Confidence Bound (UCB), Probability of Improvement (PI).

**EI in practice:** EI is the de facto standard in Bayesian optimisation libraries (Optuna, scikit-optimize, BoTorch). However, for multiple objectives (multi-objective BO) or batched queries (parallel BO) one uses qEI (quasi-Monte Carlo EI) or EHVI (Expected Hypervolume Improvement).

Why does UCB = μ*(x) + β·σ*(x) balance exploration and exploitation?

The Bayesian Optimisation Algorithm and Applications

The **Bayesian optimisation (BayesOpt) algorithm** iteratively updates the surrogate model (GP) and maximises the acquisition function to select the next evaluation point. It is efficient when the number of queries is small (10 - 100) and each is expensive. Applications: hyperparameter tuning, drug discovery, materials design.

**BayesOpt in real tasks:** Optuna (auto-selects CMA-ES or TPE), BoTorch (Facebook, GPU-accelerated), scikit-optimize - the three main libraries. AutoML systems (Auto-sklearn, NAS at Google Brain) use BayesOpt for architecture search. Application in pharmaceuticals: BayesOpt discovered new antibiotics (Stokes et al., 2020).

In which regime is Bayesian optimisation particularly advantageous over random search or a grid?

Key Ideas

GP: distribution over functions; posterior GP gives μ*(x) and σ*(x) - prediction and uncertainty
EI = E[max(f(x)−f*, 0)]: most popular acquisition function, analytically tractable
UCB = μ*(x) + βσ*(x): simple, with theoretical regret guarantees O(√T log T)
BayesOpt: GP + acquisition → iterative smart search in O(10 - 100) queries
Applications: ML hyperparameters, drug discovery, materials, A/B testing

Вопросы для размышления

How do one choose a GP kernel for Bayesian optimisation of neural network hyperparameters? What is ARD (Automatic Relevance Determination) and why is it useful?
Why does Bayesian optimisation scale poorly to high dimensions (d > 20)? What extensions exist (REMBO, MACE)?
How is parallel Bayesian optimisation (batch BO) implemented? What is qEI and why is simply applying EI multiple times incorrect?

Связанные уроки

stat-11-bayesian

Gaussian Process: a Prior Distribution over Functions

What does the posterior GP provide beyond the prediction μ*(x*)?

Acquisition Functions: EI, UCB, PI

Why does UCB = μ*(x) + β·σ*(x) balance exploration and exploitation?

The Bayesian Optimisation Algorithm and Applications

In which regime is Bayesian optimisation particularly advantageous over random search or a grid?

Key Ideas

GP: distribution over functions; posterior GP gives μ*(x) and σ*(x) - prediction and uncertainty

EI = E[max(f(x)−f*, 0)]: most popular acquisition function, analytically tractable

UCB = μ*(x) + βσ*(x): simple, with theoretical regret guarantees O(√T log T)

BayesOpt: GP + acquisition → iterative smart search in O(10 - 100) queries

Applications: ML hyperparameters, drug discovery, materials, A/B testing

Bayesian Optimisation: Smart Hyperparameter Search

Gaussian Process: a Prior Distribution over Functions

Acquisition Functions: EI, UCB, PI

The Bayesian Optimisation Algorithm and Applications

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Bayesian Optimisation: Smart Hyperparameter Search

Gaussian Process: a Prior Distribution over Functions

Acquisition Functions: EI, UCB, PI

The Bayesian Optimisation Algorithm and Applications

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки