Machine Learning
Model Selection and Hyperparameter Tuning
A single hyperparameter - learning rate - can determine whether a model converges in 10 minutes or never converges at all. Finding the right combination out of millions of possibilities is like searching for a needle in a haystack. But there are smart ways to do it.
- **Netflix and Spotify recommendation systems** - hyperparameter tuning of ensembles of dozens of models determines recommendation quality for millions of users, and a 0.1% metric difference translates to millions of dollars in revenue
- **Autonomous driving (Waymo, Tesla)** - neural networks for object recognition have hundreds of hyperparameters, and Bayesian Optimization allows finding optimal configurations in days instead of months of exhaustive search
- **Kaggle competitions** - top solutions almost always use Optuna or similar tools for hyperparameter tuning, and the difference between 1st and 100th place is often determined precisely by tuning quality
Предварительные знания
From grid search to learning the search itself
For decades, grid search was the default way to tune models: pick a few values for each knob and try every combination. It was simple and exhaustive, but it scaled terribly. In 2012, James Bergstra and Yoshua Bengio published 'Random Search for Hyper-Parameter Optimization' and showed something counterintuitive: random search usually beats grid search for the same compute budget. The reason is that only a couple of hyperparameters tend to matter, and random sampling tests far more distinct values of those important ones. The same year, Jasper Snoek, Hugo Larochelle, and Ryan Adams brought Bayesian optimization into the mainstream with 'Practical Bayesian Optimization of Machine Learning Algorithms', using Gaussian processes to model the objective and choose each next trial deliberately. By 2017, Lisha Li and colleagues introduced Hyperband, which treats tuning as a resource-allocation problem and kills off weak configurations early to spend the budget where it counts. Each step moved the field from blind enumeration toward search strategies that learn from their own results.
Grid Search
Hyperparameters are model settings specified **before training** that don't change during the process. Examples include learning rate in gradient descent, number of trees in Random Forest, and C and gamma in SVM. Unlike model weights (which are optimized automatically), hyperparameters must be tuned manually or with specialized methods. **Grid Search** is the simplest and most straightforward approach: define a set of possible values for each hyperparameter and check *all* combinations.
The main advantage of Grid Search is **guaranteed coverage**: it will find the best combination among all specified values. You are certain not to miss the optimum *within the grid*. That's why Grid Search remains the standard for tasks with 2–3 hyperparameters and small grids.
**The dimensionality problem with Grid Search:** The number of combinations grows **exponentially** with the number of hyperparameters: - 2 parameters × 5 values = 25 combinations - 3 parameters × 5 values = 125 combinations - 5 parameters × 5 values = 3,125 combinations - 10 parameters × 5 values = 9,765,625 combinations With 5-fold cross-validation, each combination requires 5 training runs. For 10 parameters that's ~50 million training runs. If one run takes 1 second, the full search would take **one and a half years**.
Grid Search works well for tasks with **2–3 hyperparameters** and when you have a good sense of reasonable value ranges. But for models with many parameters (gradient boosting - 5–8, neural networks - dozens), exhaustive search becomes impractical. Additionally, Grid Search suffers from another issue: it spends equal time on all parameters, even if some have almost no effect on results.
**Common mistake: tuning hyperparameters on the test set.** You must not use the test set for hyperparameter selection - this leads to data leakage. GridSearchCV uses cross-validation on the train set, and the test set remains untouched for the final evaluation. If you select parameters based on the test set, your quality estimate will be inflated.
A model has 4 hyperparameters, each with 10 possible values. How many combinations will Grid Search check with 5-fold cross-validation?
Random Search
Random Search solves the exponential growth problem of Grid Search elegantly: instead of checking all combinations, we **randomly sample** N combinations from the hyperparameter space. Values aren't restricted to a fixed grid - we can specify continuous distributions (e.g., C from 0.01 to 100 on a log scale) and randomly draw points from them.
The key finding from Bergstra and Bengio (2012): in most ML tasks **not all hyperparameters are equally important**. Typically 1–2 parameters have a strong effect on results, while the rest are minor. Grid Search spends the same number of attempts on all parameters. Random Search, for the same number of attempts, explores more unique values of *important* parameters, because each point has a unique coordinate along each axis.
**Why Random Search is more efficient than Grid Search (at the same budget):** Suppose we have 2 parameters, but only the first actually affects quality. - Grid Search 3×3 (9 points): we test **3 unique values** of the important parameter - Random Search (9 points): we test **9 unique values** of the important parameter 3 times more information about the important parameter at the same budget! Bergstra & Bengio showed: Random Search with 60 attempts finds a solution within 5% of the optimum with 95% probability. Grid Search may require hundreds or thousands of points for the same result.
**When to use which?** Grid Search - when you have 2–3 parameters and know reasonable ranges. Random Search - when you have 4+ parameters, are uncertain about ranges, or have a limited compute budget. In practice, Random Search is the **default choice** for most tasks: it's simpler, faster, and almost always finds a solution as good as Grid Search in a fraction of the time.
Why does Random Search often find better hyperparameters than Grid Search with the same number of attempts?
Bayesian Optimization
Grid Search and Random Search don't learn from their results - each next point is chosen independently of previous ones. **Bayesian Optimization** works fundamentally differently: after each attempt, it builds a **model of the objective function** (surrogate model) and uses it to select the next point. This is Sequential Model-Based Optimization (SMBO) - every new experiment is informed by all previous ones.
The surrogate model is most often a **Gaussian Process (GP)** or **Tree-structured Parzen Estimator (TPE)**. A GP predicts not just the mean value of the function at each point, but also **uncertainty** (confidence interval). This enables balancing between two strategies: **exploitation** (try where the model predicts good results) and **exploration** (try where the model is uncertain, because there's little data).
**Acquisition function - "where to try next?":** An acquisition function takes the surrogate model and returns the next point to evaluate: - **Expected Improvement (EI)**: how much this point is expected to *improve* on the current best. Accounts for both predicted mean and uncertainty. - **Upper Confidence Bound (UCB)**: mean + beta * std. The parameter beta controls the exploration/exploitation balance. - **Probability of Improvement (PI)**: probability that a point will beat the current best. Simpler than EI, but leans toward exploitation. In practice, EI is used most often - a good balance without manual tuning.
**When to use Bayesian Optimization?** When model evaluation is expensive (training a neural network takes hours), there are many parameters (5+), and every attempt counts. Bayesian Optimization in 20–50 attempts often beats Random Search in 200. Downsides: more complex to implement, surrogate model adds overhead (for fast models like Logistic Regression this overhead can exceed the training cost), and it works worse in very high-dimensional parameter spaces (20+).
What is the fundamental difference between Bayesian Optimization and Random Search?
AutoML
Hyperparameter tuning is only part of the problem. A full ML pipeline includes data preprocessing, feature engineering, model selection, hyperparameter tuning, and ensembling. **AutoML** automates the **entire pipeline**: from raw data to a ready model. The idea: if we can automate hyperparameter tuning, why not also automate the model choice itself and the data processing steps?
**Popular AutoML frameworks:** **Auto-sklearn** - built on top of scikit-learn. Uses Bayesian Optimization (SMAC) + meta-learning (learns from previous datasets which models usually work best). Builds ensembles from the best models. **H2O AutoML** - enterprise-level. Distributed computing, automatic stacking and blending. Supports GBM, XGBoost, Deep Learning, GLM. **Google Cloud AutoML / Vertex AI** - cloud service. Supports tables, images, text, video. Uses Neural Architecture Search (NAS) for neural network architecture tuning. **FLAML (Microsoft)** - fast and lightweight. Compute-efficient, works well with limited budgets.
**Neural Architecture Search (NAS)** is a separate direction in AutoML for neural networks. Instead of a fixed architecture (number of layers, sizes, connection types), NAS automatically **designs the neural network architecture**. Google used NAS to create EfficientNet - a family of models that outperforms hand-designed architectures at lower computational cost. But NAS requires enormous resources: the original NASNet used 500 GPUs for 4 days.
**AutoML is a powerful tool, but not a silver bullet.** AutoML doesn't replace understanding your data. It doesn't know business context: which features make sense, which metrics matter, what constraints a production system has (latency, memory, interpretability). AutoML might find the best model by accuracy, but pick one with 100ms latency instead of 1ms - and that could be critical for your service. Understanding the problem, data, and constraints remains the engineer's responsibility.
AutoML will replace ML engineers - why learn ML if machines can tune everything automatically?
AutoML automates routine work (model and hyperparameter search), but business understanding, problem framing, data collection, domain-knowledge-driven feature engineering, and interpreting results remain human responsibilities
AutoML optimizes what can be formalized: model selection and hyperparameter tuning. But it doesn't know which metric to optimize (accuracy vs recall vs business KPI), doesn't understand where the data comes from and what biases are hidden in it, can't explain to stakeholders why the model makes mistakes, and doesn't account for production constraints. The ML engineer frames the problem that AutoML solves. Without the right problem framing, even a perfect AutoML gives a useless result.
A company is launching its first ML project. The team consists of backend developers with no ML experience. What approach to model selection and hyperparameter tuning makes the most sense?
Key Takeaways
- **Grid Search:** exhaustive search over all hyperparameter combinations - guarantees the best result within the grid, but the number of combinations grows exponentially (v^p), making it impractical for 4+ parameters
- **Random Search:** random sampling from the parameter space - for the same budget it explores more unique values of important parameters than Grid Search, and works with continuous distributions instead of fixed lists
- **Bayesian Optimization:** builds a surrogate model of the objective function and selects each next point based on previous results (exploration vs exploitation) - in 20–50 attempts it often outperforms Random Search in 200 attempts
- **AutoML:** automates the entire pipeline from preprocessing to ensembling - as promised at the start, instead of searching for a needle in a haystack by hand, we can delegate this work to algorithms that search systematically and learn from each attempt
Related Topics
Hyperparameter tuning connects model evaluation, optimization, and production deployment:
- Cross-Validation — The foundation for hyperparameter evaluation: Grid Search, Random Search, and Bayesian Optimization use cross-validation for an honest estimate of each parameter combination without data leakage
- Optimizers (SGD, Adam) — Learning rate is one of the key optimizer hyperparameters. Tuning lr and other parameters (momentum, weight decay) via Bayesian Optimization can speed up neural network convergence by orders of magnitude
- Feature Engineering — AutoML automates not only model selection but also feature engineering - creating and selecting features. The right features often matter more than the right hyperparameters
- MLOps Pipeline — In production, hyperparameter tuning integrates into the CI/CD pipeline: automatic retuning when data drifts, experiment tracking (MLflow, W&B) for logging results
Вопросы для размышления
- If Random Search is more efficient than Grid Search in most cases, why is Grid Search still widely used? In what situations are determinism and exhaustive coverage more important than efficiency?
- Bayesian Optimization balances exploration and exploitation. How does this dilemma appear in other domains - for example, when choosing a restaurant (go to a known one vs try something new) or when hiring employees?
- AutoML automates routine parts of the ML pipeline. What aspects of ML engineering resist automation at their core, and why?
Связанные уроки
- ml-42-feature-engineering — Tuning follows feature preparation
- ml-44-cross-validation — Cross-validation scores each configuration
- ml-28-optimizers — Learning rate is a key hyperparameter
- ml-45-mlops-pipeline — Tuning is automated inside pipelines
- prob-04-bayes — Bayesian optimization guides the search
- stat-26-experimental-design