Data Science

Regression and Classification

Linear and logistic regression are two of the most deployed models in production ML systems - not because they are the most powerful, but because they are interpretable, fast to train and serve, and their failure modes are well-understood. FICO credit scores (used in 90% of US lending decisions) are produced by logistic regression. Airbnb's dynamic pricing uses regularized linear models as the baseline that gradient boosting trees must beat to justify their added complexity. Understanding regression models deeply - including their assumptions, failure modes, and regularization - is what allows practitioners to know when to use them and when not to.

  • **FICO scores** (Fair Isaac Corporation) use logistic regression on credit history features to predict default probability, producing the 300-850 score range used in 10 billion credit decisions per year. The model's interpretability is not just a technical preference - it is a regulatory requirement under the Equal Credit Opportunity Act, which mandates that lenders provide specific reasons for credit denial.
  • **Zillow's Zestimate** (home value estimate) uses an ensemble of ridge regression and gradient boosted trees, with ridge regression handling the base price estimation and GBT handling neighborhood-specific adjustments. Zillow's 2021 iBuying shutdown was attributed to model drift - the Zestimate's MAE grew from 2% to 6% when pandemic-era housing dynamics fell outside the training distribution.
  • **Booking.com's ranking model** (2019 KDD paper) deployed a logistic regression model that serves 1.5 million recommendations per second with sub-10ms latency. Despite its simplicity, it outperformed more complex models on A/B tests because its regularization and feature selection pipeline was better tuned to the data distribution than the feature engineering for deep models.

Linear Regression

Linear regression models the expected value of a continuous target as a linear combination of features: y = Xw + b. The ordinary least squares (OLS) solution minimizes sum of squared residuals and has a closed-form solution w = (X^T X)^{-1} X^T y. Assumptions: linearity, independence, homoscedasticity (constant residual variance), and normally distributed residuals. Violations (heteroscedasticity, multicollinearity, non-linearity) bias OLS estimates or inflate standard errors, invalidating hypothesis tests and confidence intervals.

Lasso (L1) performs automatic feature selection by driving irrelevant feature coefficients to exactly zero. Ridge (L2) shrinks all coefficients toward zero but keeps all features. ElasticNet combines both. The choice depends on expected sparsity: if only 5 of 50 features are truly predictive, Lasso outperforms Ridge. If all features contribute, Ridge is preferred. Regularization requires feature scaling - otherwise regularization penalizes large-scale features more than small-scale ones regardless of importance.

Why must features be scaled before applying Ridge or Lasso regularization, but scaling is optional for unregularized OLS?

Logistic Regression

Logistic regression models the log-odds of a binary outcome as linear in features: log(p/(1-p)) = Xw. The sigmoid function maps any real value to [0,1]: p = 1/(1 + e^{-Xw}). Unlike linear regression, no closed-form solution exists; parameters are estimated by maximizing log-likelihood with gradient-based optimization. Despite the name, logistic regression is a classification model - the decision boundary is linear in feature space, making it well-suited for linearly separable classes and interpretable feature importance via coefficients.

Logistic regression with L1 regularization (penalty='l1', solver='saga') performs both classification and feature selection simultaneously - making it useful for high-dimensional medical or text data where most features are irrelevant. The sklearn default solver 'lbfgs' does not support L1; use 'liblinear' (small datasets) or 'saga' (large datasets, sparse features). Class imbalance (99% negative, 1% positive) is addressed by class_weight='balanced', which weights each class inversely proportional to its frequency.

A logistic regression model has a coefficient of 1.5 for the feature 'num_previous_purchases'. What does this mean for model interpretation?

Evaluation Metrics

Accuracy is misleading for imbalanced datasets: a model that always predicts 'no fraud' on a 1% fraud dataset achieves 99% accuracy while detecting zero fraud cases. Precision (TP / (TP + FP)) measures how many positive predictions are correct; Recall (TP / (TP + FN)) measures how many actual positives are detected. F1 = 2 * (Precision * Recall) / (Precision + Recall) is their harmonic mean. The precision-recall tradeoff is controlled by the decision threshold: lower threshold increases recall at the cost of precision.

ROC-AUC (Area Under the Receiver Operating Characteristic Curve) measures ranking quality: the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. It is threshold-independent and class-imbalance-insensitive because it evaluates all thresholds simultaneously. However, for highly imbalanced datasets (1:1000 positive:negative), PR-AUC (Precision-Recall AUC) is more informative because ROC curves look optimistically good when the negative class dominates.

A cancer screening model has precision=0.90 and recall=0.30. What does this indicate, and is it acceptable?

Model Selection and Validation

Model selection answers: which model architecture, which hyperparameters, which feature set. The validation framework: split data into train/validation/test. Train on train, tune hyperparameters using validation, evaluate final model once on test. Test set must never be used during tuning - using it for model selection leads to optimistic performance estimates (test set overfitting). Cross-validation (k-fold) is preferred over a single validation split when data is limited, producing more stable estimates by averaging performance across k held-out folds.

The Pipeline abstraction is essential for correctness: if StandardScaler is fit on all training data before cross-validation, the scaler has seen the validation fold's mean and standard deviation, leaking information from validation to training. Pipeline ensures fit_transform is called only on the train fold within each CV iteration, and transform (not fit_transform) is called on the validation fold - the correct procedure. This subtle bug causes inflated cross-validation scores that do not generalize.

A higher accuracy score always means a better model

Accuracy is only meaningful for balanced classes; for imbalanced problems (fraud, cancer, churn), ROC-AUC, PR-AUC, or F1 at the operational threshold are appropriate metrics - and the choice depends on the relative costs of false positives and false negatives in the specific application

A credit card fraud model with 0.1% fraud rate achieves 99.9% accuracy by predicting 'not fraud' always - while losing millions to undetected fraud. The business question is not 'what fraction of predictions are correct?' but 'at what rate do we catch fraud, and at what rate do we inconvenience legitimate customers?'

A model achieves 0.95 AUC on the validation set during hyperparameter search but only 0.82 AUC on the test set. What is the most likely cause?

Key Ideas

  • **Regularization** (Ridge/L2, Lasso/L1, ElasticNet) prevents overfitting by penalizing coefficient magnitude; always scale features before regularizing; Lasso performs automatic feature selection by zeroing irrelevant coefficients.
  • **Metric choice** must match business cost structure: recall for cancer/fraud screening, precision for spam filtering, ROC-AUC for ranking quality, PR-AUC for imbalanced classification - accuracy is misleading whenever classes are imbalanced.
  • **Pipeline + cross-validation** prevents data leakage: StandardScaler must fit only on training folds within CV; the test set must never be used during model selection - only for final evaluation after all decisions are made.

Related Topics

Regression and classification connect to the broader data science workflow:

  • Experimentation and A/B Testing — Model evaluation with statistical significance testing uses the same hypothesis testing framework as A/B testing - understanding p-values, confidence intervals, and power applies directly to comparing model performance
  • Working with Messy Data — Missing value imputation, outlier treatment, and category encoding are preprocessing steps that directly affect regression and classification model performance - data quality is an upper bound on model quality

Вопросы для размышления

  • A logistic regression model for customer churn has ROC-AUC=0.81 on cross-validation. The business team asks 'how good is the model?' What additional information is needed to give a meaningful answer, and how would precision and recall at the operational threshold be computed and presented?
  • Ridge regression for house price prediction has R^2=0.71. A random forest achieves R^2=0.88 on the same data. What questions should be asked before recommending the random forest for production?
  • A dataset has 95% class 0 and 5% class 1. Describe three different strategies for handling this imbalance (class_weight, resampling, threshold adjustment) and explain when each is most appropriate.

Связанные уроки

  • ml-06-linear-regression
Regression and Classification

0

1

Sign In