Statistics
Exploratory Data Analysis: Look Before You Model
Florence Nightingale's 1858 polar area diagram proved soldiers died from sanitation, not combat wounds - one visualization changed British military medical policy. EDA still works the same way: 80% of critical ML failures originate before the first line of model code.
- Kaggle winners spend 60-70% of their time on EDA before modeling
- Anscombe's quartet: 4 datasets with identical statistics, completely different patterns
- Netflix: EDA of rating data revealed systematic weekend-viewing bias
- Medical ML: EDA of ECG signals catches sensor artifacts before classifier training
- Finance: EDA of backtest data detects look-ahead bias before it reaches production
- pandas-profiling and ydata-profiling automate baseline EDA in a single call
**1858. The Crimean War. Florence Nightingale gains access to British military hospital records.** The data shows thousands of soldiers dying. Military command is certain - casualties fall in battle. Nightingale takes the same numbers and builds a **polar area diagram** - a rose chart where each wedge represents one month and its area encodes the death toll. The blue wedge (disease) is ten times larger than the red one (combat wounds). It was not the battlefield killing soldiers - it was poor sanitation in the hospitals. One visualization changed British military medical policy forever. This is EDA: **look at the data before drawing conclusions**. Before any model, before any formula, before any hypothesis.
**What this lesson really teaches**: not "how to plot a histogram", but why **80% of critical ML project failures originate before the first line of model code**. Wrong data types, hidden missing values, outliers, collection bias - EDA catches all of these in 30 minutes while a model would silently produce nonsense. EDA is detective work: data always lies until proven otherwise.
Tukey 1977: ask questions before computing answers
**John Tukey** coined the term EDA in his 1977 book of the same name. Before him, statistics was largely **confirmatory** - test a hypothesis already formulated. Tukey proposed a different order: first look at the data without a hypothesis, let the data speak. His manifesto: *"It is important to understand what you CAN DO before you learn to measure how well you seem to have done it"*. For ML engineers, this translates simply: **fitting a model on poorly understood data means measuring how well you overfit to garbage**.
Предварительные знания
**Professional practice**: before any ML project, run `df.info()`, `df.describe()`, `df.isnull().sum()`, and `df.duplicated().sum()` - four commands, two minutes. This is the minimum EDA screening that prevents the most expensive mistakes. Kaggle surveys consistently show data scientists spend 60-80% of project time on EDA and data cleaning - not modeling.
Anscombe's Quartet: when summary statistics lie
**Summary statistics** (mean, variance, correlation) are the first instinct when looking at data. But in 1973, statistician Francis Anscombe designed an experiment that permanently changes how they are perceived. He created **4 datasets** with nearly identical numerical characteristics - but radically different visual patterns. After this, the argument "our means are equal" stops sounding convincing.
**Datasaurus Dozen (2017)** - the modern successor to Anscombe's Quartet: 13 datasets with identical mean, std, and correlation to two decimal places, including a dinosaur shape. Created to show that numbers without visualization convey nothing. The `datasetsaurus` library (Python/R) reproduces this in two lines of code.
Univariate analysis: one variable under the microscope
The first EDA step is studying each variable in isolation. **Shape of the distribution** matters more than the mean. Three tools cover 90% of cases: histogram (distribution shape), box plot (outliers and quartiles), QQ-plot (normality check).
Bivariate and multivariate: how variables relate to each other
Univariate analysis does not reveal relationships. **Scatter plot** is the fundamental tool for two numeric variables: the shape of the cloud reveals linearity, clustering, and heteroscedasticity. **Correlation heatmap** covers all pairs at once. **Pair plot** (seaborn.pairplot) - a correlation matrix combined with per-variable distributions on the diagonal - is the standard opening move in Kaggle competitions.
**Simpson's Paradox** is a classic example of how aggregated correlation can mislead. In 1973, UC Berkeley was accused of gender bias: overall, 44% of male applicants were admitted versus 35% of female applicants. Apparent discrimination. EDA broken down by department told a completely different story.
**What EDA reveals**: women predominantly applied to departments C-F with low acceptance rates (competitive humanities), men to A-B with higher rates (less competitive technical). Breaking down by department, women's admission rate equals or exceeds men's in 4 of 6 departments. **The confounding variable (department) completely reverses the conclusion.** In ML, this is the Simpson's Paradox that regularly surfaces in A/B tests when aggregating over heterogeneous segments.
Missing data: MCAR, MAR, MNAR - three mechanisms of absence
Missing values appear in almost every real-world dataset. The standard reaction - drop rows or fill with the mean. This is correct **in only one of three cases**, and incorrect handling creates hidden bias that a model silently absorbs. Rubin (1976) identified three missingness mechanisms that require distinct handling strategies.
**The most expensive mistake**: dropping rows with missing values when the mechanism is MNAR. Example - credit scoring: high-risk borrowers more often leave the 'income' field blank. Dropping those rows systematically removes bad borrowers from the training set. The model trains on biased data and underestimates risk. Similar patterns in mortgage application data went unnoticed in 2007-2008 partly due to absent EDA on missingness mechanisms.
Outlier detection: when an outlier is not an error
An **outlier** is an observation that differs strongly from the rest. The reflex - delete it. But **outlier does not equal error**. Wald's bomber story (lesson stat-01) is a canonical outlier pattern: the absence of bullet holes where they were expected told the whole story. In purchase data, one customer with a $50,000 order could be a data entry mistake or a key corporate buyer. EDA helps distinguish these cases.
**Outlier investigation protocol**: before deleting, ask three questions. First - could this value physically exist (pH > 14 in drinking water = no, delete). Second - is there context that explains it (a $1M transaction = corporate account, investigate). Third - do conclusions change with and without it (if yes - run analysis both ways and document). In fraud detection, outliers are the most valuable signal: a $50k transaction among thousands of $100 ones is information, not noise.
EDA in a production ML pipeline
**EDA in a modern MLOps workflow** EDA is not a one-time activity - in production it is an automated layer between data ingestion and feature engineering **Raw Data Ingestion** - Sources: S3, BigQuery, PostgreSQL, Kafka streams. At this stage, data arrives as-is with no quality guarantees. Schema drift, missing files, encoding issues - routine reality in any production system **Automated EDA / Profiling** - ydata-profiling (ex pandas-profiling): full HTML report in one line. from ydata_profiling import ProfileReport; ProfileReport(df).to_file('eda.html') - generates complete EDA: distributions, correlations, missing, duplicates, and alerts **Data Quality Report** - Great Expectations: data contracts with automated validation. Defines expectations (expect_column_values_to_be_between, expect_no_missing_values) and validates each batch. Pipeline halts if quality falls below threshold **Drift Detection** - Evidently AI: monitoring distribution shift in production. Compares reference distribution (training) with current distribution (production). Detects covariate shift before model degradation becomes visible in business metrics **Feature Engineering** - Transformations informed by EDA findings. Log-transform for skewed distributions, outlier capping, missing imputation strategy - all determined by EDA, not guesswork **Model Training** - Only now - baseline model and experiments. MLflow/W&B for experiment tracking. EDA artifacts stored alongside the model - for audit and reproducibility
**Netflix experimentation platform**: every A/B test begins with an automated EDA pre-check - covariate balance between groups (SRM test), distributions of key metrics, presence of outliers. If the pre-check fails, the test is automatically paused. This is one of the clearest examples of EDA embedded into a production decision-making loop.
Practice: eda_summary function for any dataframe
Interview prep - key EDA questions: **Q: Four datasets in Anscombe's Quartet share the same mean, std, correlation, and regression equation. Why bother looking at plots at all?** Key points: Summary stats aggregate distribution shape into a few numbers, irreversibly losing patterns: nonlinearity, clusters, influential outliers | For Dataset II, a linear model produces systematically wrong predictions outside the training range - extrapolation error **Q: A salary dataset has 30% missing values in the 'income' column. How do you decide whether to drop rows or impute?** Key points: 30% is a large proportion. First step: identify the mechanism (MCAR/MAR/MNAR) | For income, MNAR is most likely: high earners avoid disclosure, low earners may also. Dropping creates selection bias **Q: A water pH sensor recorded a value of 14.2. The theoretical maximum pH is 14. Is this a measurement error? How to decide?** Key points: pH 14.2 is an outlier, but not automatically an error. The pH scale has a practical maximum of ~14 at 25C, but at different temperatures and concentrations values can slightly exceed this | Protocol: 1. check sensor specs - accuracy ±0.1 or ±0.01? 2. check timestamp - could be a discharge event, (3) compare with neighboring measurements in time **Q: Dataset: 1,000 features, 50,000 observations. How to run EDA in high-dimensional space?** Key points: Univariate first: distribution, missing, outliers for each of the 1,000 features - automate via ydata-profiling or a custom script | Correlation screening: a 1,000x1,000 heatmap is unreadable. Use hierarchical clustering of features, find pairs with |corr| > 0.95 to remove near-duplicates
EDA principles: Tukey and Anscombe
John Tukey (1977) established EDA philosophy: ask questions of data before computing answers. EDA is detective work - discover what data contains, not confirm hypotheses.
Anscombe's Quartet (1973): four datasets with identical mean, variance, correlation, regression line - yet radically different shapes. Nonlinearity, outliers, and degeneracy are invisible in summary statistics.
Why does Anscombe's Quartet demonstrate the necessity of EDA?
Univariate and bivariate analysis
Univariate analysis: distribution shape, central tendency, spread, skewness, kurtosis. The five-number summary and box plots reveal outliers and asymmetry before any model is fit.
Bivariate EDA: scatter plots reveal nonlinear relationships, clusters, heteroskedasticity that Pearson correlation cannot capture. Correlation matrices for multivariate data - but beware spurious correlations.
A variable has mean=50 and median=30. What does this indicate?
Missing data and outlier detection
Missing data mechanisms: MCAR (random), MAR (depends on observed variables), MNAR (depends on the missing value itself). The mechanism determines the imputation strategy.
Outlier detection: IQR rule (Q3+1.5xIQR), z-score |z|>3, isolation forest for multivariate. Always investigate before dropping - outliers often carry the most information.
High earners systematically skip the income field in a survey. What missingness mechanism is this?
EDA in production ML pipelines
In production, EDA becomes an automated layer between data ingestion and feature engineering. Tools: ydata-profiling for HTML reports, Great Expectations for data contracts, Evidently AI for drift detection.
80% rule: most ML project failures trace to data quality issues discovered too late. Automated EDA in CI/CD catches schema drift, distribution shift, and missing value spikes before corrupting model training.
Which tool detects covariate shift between training and production data?
| Approach | Order of Operations | Risk |
|---|---|---|
| Confirmatory (traditional) | Hypothesis → Data → Test | Miss unexpected patterns |
| EDA-first (Tukey) | Data → Patterns → Hypotheses → Model | Overfit to visual artifacts |
| ML without EDA (common mistake) | Data → Train immediately | Garbage in, garbage out |
| ML with EDA (correct) | Data → EDA → Cleaning → Features → Model | Slower, but predictable |
All four datasets (I, II, III, IV) share: mean(X) = 9.0 for all four mean(Y) = 7.5 for all four var(X) = 11.0 for all four var(Y) ≈ 4.12 for all four corr(X,Y) ≈ 0.816 for all four reg line: Y = 3.0 + 0.5*X for all four But visually: Dataset I - normal linear scatter cloud Dataset II - a clean parabola (nonlinear relationship) Dataset III - a straight line with one influential outlier Dataset IV - a vertical cluster + one far outlier Conclusion: summary stats are IDENTICAL, data is FUNDAMENTALLY DIFFERENT. Fitting linear regression is correct only for Dataset I. Dataset II needs a transformation. Dataset III needs outlier investigation. Dataset IV has defective collection by design.
| Dataset | Visual Pattern | Correct Model | Error Without EDA |
|---|---|---|---|
| I | Linear scatter with noise | Linear regression | None - correct here |
| II | Smooth parabola | Polynomial regression | RMSE understated, extrapolation broken |
| III | Line + one influential outlier | Regression without outlier | Outlier shifts slope by 30%+ |
| IV | One X value + vertical cluster | Data needs investigation | Model fitting to a collection artifact |
Bin width determines what is visible in a histogram. Too narrow - noise. Too wide - structure hidden. Freedman-Diaconis rule: bin_width = 2 * IQR * n^(-1/3) Where: IQR = Q3 - Q1 (interquartile range) n = sample size Example: housing prices, n = 1000, IQR = $150,000 bin_width = 2 * 150000 * 1000^(-1/3) = 300000 * 0.1 = $30,000 Number of bins ≈ (max - min) / bin_width Why not sqrt(n) or rule of thumb = 10? FD adapts to the distribution shape: large IQR (wide distribution) - fewer bins, small IQR (tight distribution) - more bins. stats.iqr() from scipy + np.cbrt() do this in one line.
Box plot encodes five numbers: min_whisker Q1 median Q3 max_whisker |--------|====|========|--------| o o Q1 = 25th percentile Q3 = 75th percentile IQR = Q3 - Q1 Tukey fence (outlier rule): Lower fence = Q1 - 1.5 * IQR Upper fence = Q3 + 1.5 * IQR Points beyond fence = outliers (plotted individually). Example: salaries [30k, 45k, 50k, 52k, 55k, 60k, 250k] Q1 = 45k, Q3 = 60k, IQR = 15k Upper fence = 60k + 22.5k = 82.5k 250k > 82.5k → outlier Without a box plot, mean = 77k - misleading. With a box plot, median 52k + one outlier is immediately visible.
| Department | Applied (M) | Admitted (M) | Applied (F) | Admitted (F) |
|---|---|---|---|---|
| A | 825 | 62% | 108 | 82% |
| B | 560 | 63% | 25 | 68% |
| C | 325 | 37% | 593 | 34% |
| D | 417 | 33% | 375 | 35% |
| E | 191 | 28% | 393 | 24% |
| F | 373 | 6% | 341 | 7% |
| Total | 2691 | 44% | 1835 | 35% |
| Mechanism | Definition | Example | Strategy |
|---|---|---|---|
| MCAR | Missing Completely At Random - missingness independent of anything | Sensor failure at random timestamps | Row deletion is safe |
| MAR | Missing At Random - missingness depends on observed variables | Men less likely to report weight (depends on gender) | Imputation conditioned on observed vars |
| MNAR | Missing Not At Random - missingness depends on the missing value itself | High earners omit income | Model the missingness mechanism |
Method 1: Z-score (parametric, assumes normality) z_i = (x_i - mean) / std Outlier if |z_i| > 3 Problem: mean and std are themselves distorted by outliers! Better: modified Z-score: M_i = 0.6745 * (x_i - median) / MAD where MAD = median(|x_i - median|) Outlier if |M_i| > 3.5 Method 2: IQR (nonparametric, Tukey fence) Lower fence = Q1 - 1.5 * IQR Upper fence = Q3 + 1.5 * IQR (extreme fence: 3.0 * IQR for "far outliers") Advantage: works for any distribution Method 3: Isolation Forest (ML approach) Idea: anomalies are easier to isolate - require fewer random splits in a decision tree to separate contamination = expected fraction of outliers (0.01-0.1) sklearn.ensemble.IsolationForest Advantage: works in high-dimensional space
Key takeaways
- **EDA is not optional**: 80% of ML failures happen before the model. Nightingale changed military medicine not with a formula but with a diagram - data must be seen, not just computed
- **Anscombe's Quartet**: four datasets with identical mean/std/corr can be fundamentally different. Summary stats NEVER substitute for visualization
- **MCAR / MAR / MNAR** - three missingness mechanisms requiring different strategies. Dropping rows is safe only under MCAR; under MNAR it creates bias
- **Outlier does not equal error**: an outlier is a hypothesis requiring investigation. IQR fence, modified Z-score, and Isolation Forest offer tools for different dimensionalities
- **Simpson's Paradox**: aggregate correlation can be the opposite of within-subgroup correlation. Segment-level EDA is mandatory before any causal claim
- **Production EDA is automated**: ydata-profiling, Great Expectations, Evidently AI - EDA embedded in the pipeline, not a manual yearly exercise
- **Freedman-Diaconis rule** adapts histogram bin width to the data: bin = 2*IQR*n^(-1/3)
Where to go next
EDA is the starting point. After it - understand relationships, build models, verify assumptions.
- Sampling and selection bias — stat-01: an outlier in data and an outlier from sampling are different things. EDA does not compensate for a flawed sampling design
- Correlation and its traps — stat-08: after EDA - formal analysis of correlation, partial correlation, spurious correlation
- Regression diagnostics — stat-09: EDA precedes regression - checking residual normality, heteroscedasticity, influential observations
- PCA for high-dimensional EDA — stat-14: dimensionality reduction as an EDA tool in 1,000-dimensional feature spaces
- Kernel Density Estimation — stat-23: smooth alternative to the histogram - KDE for more accurate density estimation