Machine Learning

Regularization: L1, L2, ElasticNet

From ill-posed problems to feature selection

Regularization predates machine learning. In 1943 the Soviet mathematician Andrey Tikhonov introduced a stabilizing penalty to solve ill-posed inverse problems, the idea now known as Tikhonov regularization (the L2 penalty). In 1970 Arthur Hoerl and Robert Kennard brought it to statistics under the name ridge regression, taming unstable coefficients caused by multicollinearity. Then in 1996 Robert Tibshirani published the LASSO, swapping the squared penalty for an absolute-value one so that weights could be driven to exactly zero. That single change turned regularization into a feature-selection tool and reshaped how high-dimensional models are built.

A degree-20 polynomial regression is perfect on training data - error nearly zero. On new data it outputs nonsense: a negative house price, a billion-dollar salary. The model's weights bloated to thousands and millions trying to pass through every data point. How to keep the model reasonable without abandoning complex features? What if you add a penalty on weight magnitude directly to the loss function?

  • **Netflix recommendation system** uses ElasticNet regularization when training on a matrix of 200+ million users and thousands of movies - without it the model would memorize noise in ratings instead of real preferences
  • **Genomics:** when analyzing gene-disease associations, Lasso selects 50–100 significant genes from 20,000+, making results interpretable for doctors and saving years of laboratory verification
  • **Financial scoring** in banks uses Ridge regularization so that credit risk models don't overfit historical data from crisis periods and work stably in new economic conditions

Предварительные знания

  • Polynomial Regression

L2 Regularization (Ridge)

In the previous lesson we saw how a high-degree polynomial regression perfectly passes through all training points, but gives absurd predictions on new data. The reason is **excessively large weights**: the model adjusts coefficients to huge values (thousands, millions) to pass through every point. Regularization solves this by adding a **penalty for large weights** directly to the loss function.

**Ridge regression** (L2 regularization) is the simplest form of regularization. The idea: add a term to the standard MSE loss that grows as weights increase. The model now optimizes a trade-off - simultaneously minimizing the data error *and* keeping weights small.

Why the squared weight? The L2 penalty punishes large weights **disproportionately harder** than small ones. A weight of 10 gets a penalty of 100, while a weight of 100 gets 10,000. This forces the model to **uniformly shrink** all weights toward zero, but not to zero completely. Result: instead of one huge weight w=5000 and one tiny w=0.001, Ridge creates two moderate ones: w=15 and w=12.

**Multicollinearity and Ridge.** When features are highly correlated (e.g., house area in sq m and in sq ft), ordinary linear regression produces unstable weights: the slightest data change radically changes coefficients. Ridge solves this - the L2 penalty stabilizes the solution by distributing weights among correlated features.

What happens to model weights when L2 regularization (Ridge) is added?

L1 Regularization (Lasso)

Ridge shrinks all weights but leaves them nonzero. But what if we have 500 features and only 20 are really important? It would be useful if the model itself *dropped* the unnecessary ones by setting their weights to exactly zero. That's exactly what **Lasso** (Least Absolute Shrinkage and Selection Operator) - L1 regularization - does.

Why does the absolute value zero out weights, but the square doesn't? It's about the gradient. The derivative of w^2 = 2w, which approaches zero as weight approaches zero: the smaller w, the weaker the penalty pulls it to zero. So Ridge slows down and never quite reaches zero. The derivative of |w| = sign(w) - it's a *constant* (+1 or -1), independent of w's magnitude. The penalty pulls toward zero with equal force whether the weight is 1000 or 0.001. That's why Lasso drives small weights all the way to exactly zero.

**Lasso limitation:** when features are highly correlated (multicollinearity), Lasso arbitrarily picks one from the group and zeros out the rest. Which one exactly depends on random variation in the data. So Lasso is unstable with correlated features: on different subsets it picks different features from a correlated group.

**Sparse models in production.** Lasso creates *sparse* models - from hundreds or thousands of features only a few dozen remain. This is valuable not just for interpretability but also for speed: a model with 10 features predicts 50x faster than one with 500. In real-time systems (advertising, fraud detection) every millisecond counts.

Why does Lasso (L1) zero out some weights but Ridge (L2) does not?

ElasticNet: combining L1 and L2

Ridge is stable but doesn't do feature selection. Lasso does feature selection but is unstable with correlated features. **ElasticNet** combines both approaches: adds both L1 *and* L2 penalty simultaneously, getting the best of both worlds.

The main advantage of ElasticNet over pure Lasso appears with **groups of correlated features**. Say three features x1, x2, x3 are highly correlated and all useful. Lasso will arbitrarily pick one (say, x1) and zero out x2 and x3. ElasticNet, thanks to the L2 component, distributes weights among all three: w1=0.4, w2=0.35, w3=0.38 - more stable and interpretable.

**When to use which:** - **Ridge** - if all features are potentially important and a stable model is needed. The typical default choice. - **Lasso** - when most features are likely useless and automatic selection is needed. Good for interpretability. - **ElasticNet** - when there are many features, some are correlated, and a balance between selection and stability is needed. Best choice under uncertainty.

In practice, ElasticNet with l1_ratio in the 0.1–0.5 range often outperforms pure Lasso. The glmnet library (R) and sklearn (Python) use ElasticNet as a generalization: Ridge and Lasso are simply special cases with l1_ratio=0 and l1_ratio=1.

Three features x1, x2, x3 are highly correlated and all three affect the target. Which regularization method handles this best?

Tuning lambda (alpha)

We've learned three types of regularization, but the key question remains open: **how to choose the value of lambda?** This hyperparameter (called alpha in sklearn) controls the balance between fitting the data and model simplicity. Too small lambda - regularization doesn't work, overfitting. Too large - the model ignores data, underfitting.

The gold standard for tuning lambda is **cross-validation**. The data is split into K parts (folds). For each value of lambda, the model trains K times on K-1 parts and is evaluated on the remaining one. The lambda with the lowest average validation error is chosen. In sklearn this is done automatically.

**Standardization is mandatory!** Regularization penalizes weights by absolute magnitude. If feature 'salary' has range 30,000–200,000 and 'age' has range 18–80, the salary weight will be thousands of times smaller just due to scale, and regularization will unfairly penalize it too weakly. Always use StandardScaler *before* regularization to put all features on the same scale.

The stronger the regularization (larger lambda), the better the model - setting lambda as high as possible always improves generalization

Too strong regularization leads to underfitting - the model becomes too simple and doesn't capture real patterns in data. The optimal lambda is found via cross-validation and depends on the specific dataset

Regularization is not 'the more the better', but a balance. At lambda = 0 the model overfits, at lambda -> infinity all weights = 0 and the model predicts nothing. The optimum is somewhere in between, and the only reliable way to find it is cross-validation

What happens if lambda (alpha) = 0 in Ridge or Lasso?

Key ideas

  • **Ridge (L2)** adds penalty SUM(w_i^2) - uniformly shrinks all weights toward zero without zeroing them; stable with multicollinearity
  • **Lasso (L1)** adds penalty SUM(|w_i|) - zeros out unimportant weights completely, performing automatic feature selection; unstable with correlated features
  • **ElasticNet** combines L1 and L2, l1_ratio controls the balance; best choice under uncertainty and with correlated features
  • **lambda (alpha)** - regularization strength hyperparameter: 0 = overfitting, infinity = underfitting; optimum is found via cross-validation (RidgeCV, LassoCV, ElasticNetCV)
  • Regularization is the answer to 'how to keep the model's coefficients in check': the penalty on weight magnitude prevents polynomial regression from inflating coefficients to absurdity, as we discussed at the beginning

Related topics

Regularization is a central idea in ML, connecting linear models with deep learning and feature engineering:

  • Linear Regression — The base model to which Ridge, Lasso and ElasticNet add penalty terms to fight overfitting
  • Polynomial Regression — Demonstrated the overfitting problem at high polynomial degree - exactly what regularization solves, not by removing features but by constraining weights
  • Gradient Descent — The optimization method by which Ridge, Lasso and ElasticNet are practically trained on large data (instead of the analytical solution)
  • Feature Engineering — Lasso and ElasticNet automatically select features - this is part of feature engineering where manual and automatic selection complement each other

Вопросы для размышления

  • If Lasso zeroed out 95 of 100 features, does that mean the zeroed-out features are useless? Or could they have been zeroed out due to correlation with the remaining ones?
  • In neural networks with millions of parameters, regularization is also used (L2 = weight decay). Why is dropout considered an alternative to L2 regularization, even though it works completely differently?
  • Standardization of features is mandatory before regularization. But once StandardScaler is applied, weight interpretability in original units is lost. How is this dilemma resolved in practice?

Связанные уроки

  • ml-06-linear-regression — Ridge and Lasso are extensions of linear regression with a penalty
  • ml-07-polynomial-regression — Overfitting in complex models is the main motivation for regularization
  • ml-09-gradient-descent — Regularized models are optimized with gradient descent
  • stat-03-mle
  • cvx-01
Regularization: L1, L2, ElasticNet

0

1

Sign In