Machine Learning
Data Preparation and Cleaning
2018. Amazon shut down its ML hiring system - it discriminated against women. The model trained on 10 years of historical data that reflected the industry's gender skew. Garbage In - Garbage Out at corporate scale. Engineers at Google, Netflix, and Uber agree on one thing: 80% of ML work is not models and algorithms, but data preparation. If that step is done poorly, no architecture saves the result.
- **Kaggle competitions:** most wins are determined not by model choice but by feature engineering and data preprocessing quality. Participants spend 70-80% of time on cleaning and transformation
- **Amazon:** abandoned its ML hiring system in 2018 because the model trained on gender-biased historical data - data leakage from biased data created a discriminatory model
- **Medicine:** researchers found a pneumonia diagnosis model was actually identifying the type of X-ray machine (hospitals with severe patients used different equipment) - an example of false correlations from dirty data
Historical context
In 1977 statistician John Tukey published 'Exploratory Data Analysis', reshaping how practitioners work with data. Before Tukey, statistics was confirmatory: build a model, test hypotheses. Tukey proposed exploring data visually first - boxplots, stem-and-leaf plots, 5-number summaries. The boxplot he introduced is the direct ancestor of IQR-based outlier detection used in ML pipelines today.
Предварительные знания
Data Cleaning: missing values and outliers
**2018. Amazon shut down its ML hiring tool** - it discriminated against women. Root cause: 10 years of historical hiring data reflected the industry's gender skew. Garbage In - Garbage Out: the model learned not 'who is a strong candidate' but 'who historically got hired'. There is an unspoken industry rule: **80% of an ML engineer's time goes to data preparation**, and only 20% to building models.
**Three types of missing values:** - **MCAR** (Missing Completely At Random) - missing randomly, independent of anything. Example: network failure during data transfer - **MAR** (Missing At Random) - missing depends on other features. Example: younger people more often skip income - **MNAR** (Missing Not At Random) - missing depends on the value itself. Example: people with high income hide it The type of missingness determines the filling strategy. MCAR can be dropped, MAR is better filled, MNAR is the hardest case.
**Do not delete data mindlessly.** Before removing a row or replacing a value, ask: 1. Why is this value missing? The absence itself may carry information 2. Is this really an outlier or a rare but real event? A $50M apartment downtown is not an error 3. How much data will remain after cleaning? Removing all rows with missing values may leave only 10% of the dataset
A dataset contains 10,000 rows. The 'income' feature is missing for 40% of records. Which strategy is most reasonable?
Normalization and standardization
After cleaning the data, a second problem remains: features live on **completely different scales**. Age ranges 0-100, salary 0-1,000,000, height 150-200. Many ML algorithms (linear regression, SVM, neural networks, k-NN) use **distance** between points. When one feature is 10,000 times larger than another, it completely dominates the rest. The model concludes salary is the most important factor simply because the numbers are bigger.
**When to use which:** - **MinMaxScaler** [0, 1] - neural networks (activations work in [0,1]), algorithms sensitive to range - **StandardScaler** (z-score) - SVM, linear regression, PCA, k-means (assume a normal-ish distribution) - **No normalization needed** - decision trees, Random Forest, XGBoost, LightGBM. They split by thresholds, not distances
**Common mistake:** calling `fit_transform()` on both training and test data. The scaler must memorize parameters (mean, std, min, max) **only from training data**. Test data is transformed using those same parameters via `transform()`. Otherwise test information leaks into training (data leakage), and metrics get inflated.
Training a k-NN model to predict apartment prices. Features: area (30-200 sq m) and distance to subway (100-15000 m). What happens without normalization?
Encoding categorical features
ML algorithms work with numbers, not words. But real data is packed with categorical features: city, color, car type, profession. Feeding the string 'New York' straight into a neural network is impossible - it needs to become a number or a vector of numbers. The choice of encoding method **directly impacts model quality**, because different encodings carry different information.
**Encoding selection rule:** 1. Feature with **natural order** (size, level, rating) - Ordinal Encoding 2. Feature **without order, few categories** (up to 10-15: color, gender, type) - One-Hot Encoding 3. Feature **without order, many categories** (city from 1000, profession from 500) - Target Encoding or Embedding 4. **Tree-based models** (XGBoost, LightGBM, CatBoost) - Label Encoding works fine, CatBoost handles categories natively
Training linear regression. The 'district' feature has 5 values: Center, North, South, East, West. Which encoding to choose?
Data splitting: train/validation/test
Data is clean, features normalized, categories encoded. Now the last - and most treacherous - step: **split the data into parts**. Think of exam prep: studying from the exact questions that appear on the exam yields 100% but zero learning. An ML model works the same way: evaluate it on the same data it trained on and the metrics look great, but it will fail in production.
**Data leakage: the most dangerous mistake in ML.** Preprocessing (normalization, encoding, filling missing values) must happen **after** splitting into train/test, not before. If mean for imputation is computed across all data including test - test information leaks into training. Correct order: split -> fit on train -> transform val/test.
Splitting into train and test then doing preprocessing before splitting is fine - the model still does not see correct answers from the test
Preprocessing AFTER splitting: fit() only on train, transform() on test. Even normalization and mean imputation cause data leakage when parameters are computed on the whole dataset
Data leakage happens not only through the target variable (y), but also through feature statistics (X). Mean and std for normalization computed including test data means the model indirectly knows about test distribution. Result: test metrics inflated by 1-5%, production performance worse than expected
ALL data was normalized with StandardScaler, then split into train and test. What is the risk?
Key ideas
- **Data cleaning:** missing values (fillna, KNN Imputer), outliers (IQR, z-score), duplicates - 80% of an ML engineer's work, and cleaning quality matters more than algorithm choice
- **Normalization:** MinMaxScaler [0,1] for neural networks, StandardScaler (z-score) for linear models - without it, large-scale features dominate the rest
- **Category encoding:** One-Hot for orderless features (color, city), Ordinal for ordered ones (size, level), Label Encoding for tree-based models
- **Train/Val/Test split:** preprocessing AFTER splitting, fit only on train - otherwise data leakage inflates metrics, and production performance drops
Related topics
Preprocessing is the step every model depends on, and the choices made here ripple into training, evaluation, and the algorithms that come next:
- ML types — Encoding and scaling depend on the task: supervised models need a clean label column, while clustering is sensitive to feature scale
- Model evaluation — Data leakage during preprocessing inflates metrics; fitting scalers and imputers on the training split only is what keeps evaluation honest
- Linear regression — Coefficient-based models need standardized features so that no single feature dominates because of its units
- Feature engineering — Cleaning and encoding are the entry point; feature engineering builds on clean data to create the signals a model actually learns from
Вопросы для размышления
- Missing 'income' values were filled with the median. But if missing values are not random (people with high income hide it more often), how does this distort the model? What alternative strategies might help?
- Why is data leakage so hard to detect? The model shows great metrics on the test - everything looks perfect. How can the absence of leakage be verified?
- In real production, data arrives continuously and its distribution can shift over time (concept drift). How does this affect normalization parameters computed a year ago?
Связанные уроки
- ml-01-intro — GIGO principle introduced there - this lesson is where we fight it hands-on
- ml-05-evaluation — Correct train/val/test splitting is the prerequisite for any honest model evaluation
- ml-03-math-foundations — Z-score normalization draws on mean and std deviation from that math foundation
- ml-06-linear-regression — First algorithm where the full pipeline - clean, normalize, encode, split - gets applied
- ml-42-feature-engineering — Feature engineering is the next level: crafting new features from already clean data
- stat-31-eda