Machine Learning
Cross-Validation and Overfitting Prevention
A model showed 99% accuracy during development, but in production accuracy dropped to 60%. The cause was not a code bug - the problem was in how the model was evaluated. How you measure model quality is no less important than the model itself.
- **Hyperparameter tuning** - Grid Search and Random Search use cross-validation to evaluate each parameter combination. Without CV, hyperparameter tuning becomes fitting to one specific data split, and the best parameters won't actually be best on new data
- **Medical diagnosis** - when training a model on patient scans, Group K-Fold prevents data leakage between scans of the same person. Without grouping, the model "recognizes" the patient rather than the disease, and CV accuracy is inflated by 10–20% compared to real-world performance
- **Neural network training** - Early Stopping is used in virtually every training pipeline: from ResNet to BERT. Without it, a neural network with millions of parameters inevitably overfits training data, memorizing noise instead of patterns
Предварительные знания
How statisticians learned to reuse the same data
The core problem of cross-validation is old: how to judge a model on data when every sample is precious and a separate test set is unaffordable? The idea of holding out part of the data and rotating it had circulated since the 1930s, but it lacked a rigorous footing. That arrived in 1974, when Mervyn Stone published 'Cross-Validatory Choice and Assessment of Statistical Predictions' in the Journal of the Royal Statistical Society. A year later, in 1975, Seymour Geisser introduced his predictive sample reuse method and made the case for using the same data for both fitting and honest assessment. Their work turned an informal trick into a principled procedure. The leave-one-out scheme (LOOCV), where each single observation in turn becomes the test set, is the extreme case of their framework. K-fold cross-validation, splitting the data into k parts and rotating which one is held out, became the practical compromise: enough folds for a stable estimate, few enough to keep the cost manageable. Fifty years on, k-fold is still the default way to estimate how a model will behave on data it has never seen.
K-Fold Cross-Validation
The most common way to evaluate a model is to split data into train and test (holdout). But holdout has a serious problem: the result **strongly depends on which data ends up in the test set**. One random split might give 95% accuracy, another 88%. Which is the true value? We don't know. Holdout on small data is like flipping a coin once to decide whether it's fair. **K-Fold Cross-Validation** solves this: instead of one split, K splits are made and the model is evaluated K times.
The K-Fold algorithm is simple: data is split into K equal parts (folds). On each iteration, one part becomes the validation set and the other K-1 parts form the training set. The model is trained K times, each time on a new split. The final metric is the **average** across all K iterations. The standard deviation shows **model stability**: a large spread means the model is sensitive to data selection.
**Choosing K: 5 or 10?** - **K=5** - standard choice. Each fold contains 20% of the data. Good balance between compute cost and estimate reliability. - **K=10** - slightly more reliable, but twice as expensive. Each fold contains 10% of the data, and training on 90% is closer to training on the full dataset. - **K=3** - economical for very large datasets where even one training run takes hours. Rule of thumb: K=5 for most tasks, K=10 if you have little data (fewer than 5,000 samples) and training is fast.
Why is K-Fold better than holdout? First, **each data point participates in validation exactly once** - nothing is wasted. Second, we get not a single number but a **distribution of scores**, which lets us assess model stability. Third, the average of K estimates is **less noisy** than a single estimate from one split. In practice, cross-validation is the industry standard for model comparison and hyperparameter tuning.
In 5-Fold Cross-Validation, the model is trained 5 times on different splits. What percentage of data is used for training on each iteration?
Stratified and Specialized Variants
Regular K-Fold splits data randomly, ignoring class distribution. If a dataset is 90% class A and 10% class B, a random split may create a fold with no class B at all. The model trains without examples of the rare class and shows unrealistic metrics. **Stratified K-Fold** solves this: it guarantees that **class proportions in each fold match the proportions in the full dataset**.
**Group K-Fold** is another important variation. Consider a medical study where one patient may have 10 scans. If some scans from the same patient end up in train and others in test, the model can simply "recognize" the patient rather than learning to diagnose. This is called **data leakage** - information leaking between train and test. Group K-Fold guarantees that **all data from one group (patient) is entirely in one fold**.
**Time Series Split - a special case:** Time series data cannot be shuffled! If the model sees data from the future during training, it will show unrealistically good metrics. TimeSeriesSplit preserves chronological order: - Training always on **past** data - Validation always on **future** data - Training set **grows** with each fold This is the only correct way to cross-validate for forecasting prices, weather, demand, or any time-dependent data.
A common beginner mistake is applying StandardScaler or other normalization **before** cross-validation. If you normalize the entire dataset and then split into folds, the validation fold "knows" statistics from the training data (mean, variance). This is data leakage. The correct approach: a **Pipeline** where normalization happens fresh inside each fold. sklearn handles this automatically if you wrap the scaler and model in a Pipeline.
A medical dataset has 200 patients, each with 5 scans (1,000 scans total). You use regular 5-Fold CV instead of Group K-Fold. What will happen?
Leave-One-Out (LOO)
Leave-One-Out (LOO) is the extreme case of K-Fold where **K = N** (the number of samples). On each iteration, exactly one sample is used for validation, and all other N-1 are used for training. With 500 data points, the model is trained 500 times, each time leaving a different point for testing. LOO gives the **least biased estimate**: the model trains on nearly all data (N-1 out of N), which is maximally close to training on the full dataset.
LOO has a paradoxical property: having the **lowest bias**, it simultaneously has the **highest variance** among CV methods. Why? Because training sets from different iterations overlap on N-2 out of N-1 points - they are nearly identical. This produces N very similar models, and their predictions are highly correlated. The average of correlated estimates has high variance, unlike the average of independent estimates.
**Bias-Variance tradeoff in cross-validation:** - **LOO (K=N):** bias - minimal (training on N-1 points), variance - maximum (estimates are highly correlated) - **10-Fold:** bias - slightly higher (training on 90% of data), variance - moderate - **5-Fold:** bias - higher still (training on 80% of data), variance - lowest (folds are more independent) - **2-Fold:** bias - maximum (training on 50%), variance - low In practice, 5-Fold and 10-Fold yield better **overall** results (MSE = bias^2 + variance) than LOO. That's why LOO is rarely used.
**When to use LOO?** Only with very small datasets (fewer than 50–100 samples), where every point matters and losing even 20% to validation (as in 5-Fold) is significant. In medicine, where data is scarce and each sample is costly, LOO is more common. For large datasets, LOO is impractical: N training runs when N=100,000 is unjustifiably slow, and the result is no better than 5-Fold or 10-Fold.
Leave-One-Out has the lowest bias among CV methods. Why can its overall estimation error still be higher than 5-Fold?
Early Stopping
Cross-validation helps **evaluate** a model but doesn't prevent overfitting during training. **Early Stopping** solves exactly that: we monitor validation loss at each training epoch and **stop when it stops improving**. The intuition is simple: early in training the model learns useful patterns (both train and val loss decrease), but at some point it starts memorizing noise in the training data (train loss keeps falling, val loss starts rising). Early Stopping catches that moment.
The key parameter of Early Stopping is **patience**: how many epochs without improvement we're willing to wait. A small patience (2–3) can stop training too early when val loss briefly rises before continuing to fall. A large patience (20–50) lets the model "weather" temporary increases, but risks overfitting. In practice, patience = 5–10 is a good starting point. Importantly: after stopping, we **restore the weights from the best epoch** (restore best weights), not the most recent ones.
**Early Stopping = implicit regularization:** Early Stopping limits model complexity by preventing it from memorizing training data. This is analogous to L2 regularization (weight decay): the earlier we stop, the closer the weights stay to their initial (typically small) values, and the simpler the model. Advantage over L2: no need to tune a regularization coefficient lambda. Instead the model "finds" its optimal complexity through the stopping point. Downside: requires holding out a validation set from training data, which reduces the amount of data available for training.
In practice, Cross-Validation and Early Stopping work together: CV is used for **hyperparameter selection** (architecture, learning rate, batch size), while Early Stopping determines the **number of epochs** in each fold. After selecting the best hyperparameters, the model is trained on all data (excluding the test set) with Early Stopping, and the final evaluation is done **once** on the held-out test set. This is the standard pipeline that guards against overfitting at every level.
Cross-validation solves all model evaluation problems
CV can give optimistic (inflated) estimates in the presence of data leakage, temporal dependence, or distribution shift between training and production
If data contains information leakage (e.g., features contain the target variable in disguised form), CV won't detect it - all folds will show high metrics. If data is time-dependent and you use regular K-Fold instead of TimeSeriesSplit, future data will leak into training. If the production data distribution differs from the training distribution (distribution shift) - e.g., the model was trained on users from one country but is used in another - CV on training data won't predict the performance drop. Always check: is there leakage? Is chronological order preserved? Do the training and deployment distributions match?
With Early Stopping and patience=5, validation loss showed the following dynamics by epoch: 0.50, 0.45, 0.42, 0.43, 0.44, 0.41, 0.39. At which epoch will stopping occur?
Key Takeaways
- **K-Fold Cross-Validation:** data is split into K parts, the model is trained K times - each part takes a turn as the validation set. The average of K scores is more reliable than a single holdout, and the standard deviation shows model stability. K=5 or K=10 is the standard choice
- **Specialized CV variants:** Stratified K-Fold preserves class proportions, Group K-Fold prevents data leakage between groups (patients, users), TimeSeriesSplit preserves chronological order - the choice of method is determined by data structure
- **LOO - the extreme case:** K=N gives minimum bias (training on N-1 points) but maximum variance due to correlation between folds. Practically useful only for very small datasets (fewer than 50 samples)
- **Early Stopping + CV in production:** CV selects hyperparameters, Early Stopping determines the number of epochs, and the held-out test set provides the final honest evaluation - exactly this pipeline would have protected the model from dropping from 99% to 60%
Related Topics
Cross-validation and Early Stopping are foundational tools linking model evaluation with regularization and hyperparameter tuning:
- Metrics and Model Evaluation — Cross-validation computes metrics (accuracy, F1, AUC) on each fold, giving a reliable estimate instead of a single number. Without the right metrics, CV is useless: accuracy can look high on imbalanced classes while poor recall on the rare class goes unnoticed
- Hyperparameter Tuning — Grid Search, Random Search, and Bayesian Optimization use cross-validation as their inner evaluation loop. Each hyperparameter combination is evaluated via K-Fold CV, and the best combination is selected based on the mean metric
Вопросы для размышления
- Why does applying StandardScaler before splitting into folds cause data leakage, even when the leakage seems minimal? How does this affect the difference between CV score and actual production metric?
- If LOO has the lowest bias among all CV methods, why does 5-Fold often give a more accurate overall estimate in practice? How does this relate to the bias-variance tradeoff?
- Early Stopping is called "implicit regularization". If a model already has L2 regularization (weight decay), does it still make sense to add Early Stopping? Can they interfere with each other?
Связанные уроки
- ml-43-hyperparameters — CV provides scores for tuning
- ml-05-evaluation — CV aggregates evaluation metrics
- ml-08-regularization — CV detects and curbs overfitting
- stat-18-bootstrap — Both resample to estimate generalization
- stat-01-sampling — Folds rely on representative sampling
- stat-05-hypothesis