Statistical Learning Theory

Gaussian Processes and Bayesian Optimization

Google Vizier (2017) uses GP-Bayesian optimization to tune hyperparameters for all Google products, running more than 10 million experiments per year. DeepMind AutoML with GP-BO found an ImageNet architecture surpassing manual design in 300 vs. 3000 experiments.

  • BoTorch (Meta Research, 2020) is used in drug discovery at Pfizer to optimize molecular properties, reducing synthesis count from thousands to tens.
  • SMAC3 (AutoML Freiburg) uses GP-BO to tune neural architecture search.

GP regression: Bayesian inference in RKHS

A Gaussian Process (Rasmussen, Williams, 2006) is a distribution over functions specified by a mean function m(x) and covariance function (kernel) k(x,x'). GP regression coincides with KRR at lambda = sigma_n^2: the posterior mean is kernel regression, the posterior variance measures uncertainty. Google Vizier (2017) uses GP to optimize hyperparameters across all Google products.

The posterior variance of a GP at x* is zero when...

Bayesian optimization: UCB and EI

Bayesian optimization (Mockus 1978, Srinivas et al. 2010) uses a GP as a surrogate model and an acquisition function for exploration/exploitation balance. UCB is theoretically optimal: Srinivas et al. proved sublinear regret O(sqrt(T log T)) for smooth functions. AutoML based on GP BO found ImageNet hyperparameters 10x faster than manual tuning.

GP-BO selects the next query where sigma(x) is large even if mu(x) is small because...

Key results

  • GP = distribution over functions with kernel as covariance function.
  • Posterior mean = KRR with lambda = sigma_n^2.
  • UCB acquisition = mu + beta*sigma balances exploitation and exploration.
  • Theoretical regret bound: O(sqrt(T * gamma_T * log T)).
Gaussian Processes and Bayesian Optimization

0

1

Sign In