Statistical Learning Theory

Gaussian Processes and Bayesian Optimization

Google Vizier (2017) uses GP-Bayesian optimization to tune hyperparameters for all Google products, running more than 10 million experiments per year. DeepMind AutoML with GP-BO found an ImageNet architecture surpassing manual design in 300 vs. 3000 experiments.

BoTorch (Meta Research, 2020) is used in drug discovery at Pfizer to optimize molecular properties, reducing synthesis count from thousands to tens.
SMAC3 (AutoML Freiburg) uses GP-BO to tune neural architecture search.

GP regression: Bayesian inference in RKHS

A Gaussian Process (Rasmussen, Williams, 2006) is a distribution over functions specified by a mean function m(x) and covariance function (kernel) k(x,x'). GP regression coincides with KRR at lambda = sigma_n^2: the posterior mean is kernel regression, the posterior variance measures uncertainty. Google Vizier (2017) uses GP to optimize hyperparameters across all Google products.

The posterior variance of a GP at x* is zero when...

Bayesian optimization: UCB and EI

Bayesian optimization (Mockus 1978, Srinivas et al. 2010) uses a GP as a surrogate model and an acquisition function for exploration/exploitation balance. UCB is theoretically optimal: Srinivas et al. proved sublinear regret O(sqrt(T log T)) for smooth functions. AutoML based on GP BO found ImageNet hyperparameters 10x faster than manual tuning.

GP-BO selects the next query where sigma(x) is large even if mu(x) is small because...

Key results

GP = distribution over functions with kernel as covariance function.
Posterior mean = KRR with lambda = sigma_n^2.
UCB acquisition = mu + beta*sigma balances exploitation and exploration.
Theoretical regret bound: O(sqrt(T * gamma_T * log T)).