Machine Learning
Anomaly Detection
Every second Visa processes 1700 transactions, and among them are hidden fraudulent ones - less than 0.1% of the total flow. How do you find a needle in a haystack when you don't even know what the needle looks like? Anomaly detection is the art of finding what doesn't look like everything else, without a prior description of exactly what to look for.
- **Visa and Mastercard** use anomaly detection to block fraudulent transactions in 50ms - systems analyze the amount, geolocation, time, and purchase pattern, preventing $30+ billion in annual losses
- **NASA** applies autoencoders to monitor spacecraft telemetry - detecting anomalous sensor readings before the deviation leads to a failure has saved several missions
- **Amazon and Google Cloud** detect DDoS attacks using Isolation Forest, analyzing network traffic patterns - an abnormal spike in requests from one IP range is identified in seconds
From control charts to the Isolation Forest
Hunting for outliers is older than machine learning. In 1924 Walter Shewhart, an engineer at Bell Labs, sketched the first control chart on a single page of a memo to his boss. His insight was that any process has natural variation, and a point falling beyond three standard deviations is probably a real fault rather than chance, the same three-sigma rule still behind the z-score today. Control charts became the backbone of industrial quality control for the rest of the century. The machine-learning era brought a different idea: instead of modeling what is normal, model how easily a point can be separated from everything else. In 2008 Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou published the Isolation Forest, which isolates anomalies with random splits and needs only a handful of cuts to fence off an outlier. It scaled to millions of rows and quickly became a default tool, but its lineage runs straight back to Shewhart's hand-drawn chart on the factory floor.
Предварительные знания
Isolation Forest
Consider describing the location of a specific house on a map. A house in a dense residential block requires many qualifiers: "such-and-such street, between houses 12 and 14, third entrance". A house standing alone in an empty field needs only one remark: "that house in the field". **Isolation Forest** works on the same principle: anomalies are *easy to isolate* because they are far from the rest of the data.
The algorithm builds an **ensemble of random trees** (usually 100). Each tree recursively partitions the space: randomly choosing a feature and a random split point within that feature's range. An anomalous point, far from the others, will be cut off closer to the root of the tree - it needs **fewer splits to become isolated**. A normal point surrounded by neighbors will require many splits before it ends up alone in a leaf.
The key metric is the **anomaly score**, based on the average depth of isolation for a point across all trees in the ensemble. The formula normalizes depth relative to the expected depth in a random binary tree. Score is close to 1 for anomalies (short path) and to 0 for normal points (long path). Threshold 0.5 is the boundary: above it - anomaly.
**Why Isolation Forest is so popular:** - **Speed:** O(n * log(n)) - log-linear, handles millions of rows - **Doesn't require distances:** unlike density-based methods, doesn't compute pairwise distances - **Robust to high dimensionality:** each tree works with a random subset of features - **No labels needed:** fully unsupervised, doesn't require labeled anomalies - **Single hyperparameter:** `contamination` - expected fraction of anomalies (other parameters work well by default)
**Tuning `contamination`:** this parameter critically affects results. If set to 0.01 (1%) but actual anomalies are 5% - the model will miss some. If set to 0.1 (10%) but anomalies are only 1% - there will be many false positives. Start with a realistic estimate (for fraud detection usually 0.001-0.01) and tune based on precision/recall.
Why do anomalous points in Isolation Forest have a short average path in the trees?
One-Class SVM
Isolation Forest isolates anomalies with random splits. But what if we need a stricter approach - to **explicitly draw a boundary** around normal data? **One-Class SVM** does exactly that: training *only on normal data*, it finds a hypersurface that encloses the normal region and cuts off everything beyond it.
The idea comes from classical SVM. Regular SVM separates two classes with a maximum margin. One-Class SVM solves a different task: **separating the data from the origin** in feature space. All training points (normal) must end up on one side of the hyperplane, and the origin on the other. The result is a "bubble" in feature space - inside is normal, outside is anomaly.
The **RBF kernel** (Radial Basis Function) allows One-Class SVM to build non-linear boundaries. Data is projected into a higher-dimensional space where a linear hyperplane can describe the complex shape of the "normal region". The **gamma** parameter controls the influence radius of each point: high gamma - the boundary tightly wraps around the data (risk of overfitting), low gamma - the boundary is smoother and more generalized.
**Parameter `nu` - the key hyperparameter of One-Class SVM:** - Sets the **upper bound on the fraction of errors** on training data - Simultaneously the **lower bound on the fraction of support vectors** - `nu=0.05` means: we allow up to 5% of training data to fall outside the boundary - Smaller `nu` - wider boundary, fewer false positives, but more missed anomalies - Larger `nu` - tighter boundary, more detections, but higher risk of false positives
**When One-Class SVM is worse than Isolation Forest:** - **Large datasets:** O(n^2) - O(n^3) in memory and time due to the kernel matrix. On 100K+ points it becomes impractical - **High dimensionality:** RBF kernel works poorly with hundreds of features without feature selection - **Preprocessing is mandatory:** StandardScaler is critical, without it SVM doesn't work correctly Use One-Class SVM when data is small (up to 50K), dimensionality is moderate, and a clear decision boundary is important.
What is the main difference between One-Class SVM and regular (two-class) SVM in the context of anomaly detection?
Autoencoder for Anomaly Detection
Isolation Forest and One-Class SVM handle tabular data well. But what if the data consists of surveillance camera images or time series from thousands of sensors? Here the **autoencoder** takes center stage - a neural network that learns to *compress* data and *reconstruct* it back. The idea is brilliantly simple: if the autoencoder is trained on normal data, it will reconstruct normal data well, but **poorly reconstruct anomalies**, because it has never seen anything like them.
The anomaly detection process has two stages. **Stage 1 (training):** the autoencoder trains *only on normal data*, minimizing reconstruction error. The network memorizes normality patterns and learns to reproduce them. **Stage 2 (detection):** for each new point we compute the reconstruction error. If it exceeds the threshold - the point is anomalous. The threshold is usually chosen as the 95th or 99th percentile of the error on the training set.
**Advantages of the autoencoder over Isolation Forest and One-Class SVM:** - **Complex data:** works with images, audio, time series - where IF and OC-SVM are ineffective - **Scalability:** trains on GPU, processes millions of points - **Architecture flexibility:** convolutional layers for images, LSTM for time series, attention for sequences - **Interpretability:** the reconstruction error per feature reveals *which exact* features were poorly reconstructed - this points to the type of anomaly
**Choosing the threshold** is a critical step. The most common approach is the **percentile method**: compute reconstruction error on the training set and take the 95th or 99th percentile as the threshold. More advanced methods: fitting the error distribution (often log-normal) and choosing the threshold at a given significance level. In practice the threshold is often set manually by analyzing the error histogram and balancing precision/recall.
**Autoencoder anomaly detection pitfalls:** - If the bottleneck is too wide - the autoencoder will memorize ALL data including anomalies, and reconstruction error will be low for everything - If the bottleneck is too narrow - error will be high even for normal data, making the threshold useless - Training data must be **clean** - even 1-2% anomalies in the training set can ruin the model - The autoencoder may learn to reconstruct *typical* anomalies if they appear frequently
How does an autoencoder determine whether a new point is an anomaly?
Statistical Methods
We've looked at three ML methods for anomaly detection. But sometimes you don't need heavy artillery - statistics is enough. **Z-score**, **IQR**, and **Mahalanobis distance** are classic methods that work fast, require no training, and are transparent in interpretation. They are ideal as a first analysis step and a baseline for comparison with ML models.
**Z-score** measures how many standard deviations a point is from the mean. Formula: z = (x - mean) / std. Rule: if |z| > 3, the point is considered anomalous. This works under a **normal distribution** - in that case only 0.27% of values fall beyond 3 sigma. The method is simple and fast, but fragile: one strong anomaly shifts the mean and std, masking other anomalies.
**Mahalanobis distance** is a multivariate generalization of z-score. It accounts for **correlations between features**, which z-score and IQR cannot do since they work with each feature separately. Formula: D = sqrt((x - mean)^T * S^(-1) * (x - mean)), where S is the covariance matrix. A point with a large Mahalanobis distance is anomalous, even if it is within the normal range for each individual feature.
**Practical method selection rule.** Start simple: IQR for initial analysis of 1D features. If data is multivariate and correlated - Mahalanobis. For moderately complex tabular data - Isolation Forest (fast, universal, requires minimal tuning). For data with a clear "normal" pattern and no anomaly examples - One-Class SVM. For images, time series, and complex structures - autoencoder. And remember: a combination of methods (ensemble) often outperforms any single one.
For anomaly detection it's enough to use z-score with a threshold of |z| > 3 - this is a universal and reliable method
Z-score assumes normal distribution and is sensitive to outliers. For real data you need to choose the method for the task: IQR for robustness, Mahalanobis for multivariate correlations, Isolation Forest for tabular data, autoencoder for complex structures
Real data rarely has a normal distribution. Extreme anomalies distort mean and std, masking less pronounced anomalies. Multivariate anomalies (normal on each individual feature but anomalous in combination) aren't detected by z-score at all. The method is a tool, and different tasks need different tools.
Why did IQR detect both anomalies (250 and 2000 ms) while z-score only detected one (2000 ms)?
Key Ideas
- **Isolation Forest** isolates anomalies with random splits - the shorter the path in the tree, the more likely it's an anomaly. Fast, scalable, universal for tabular data
- **One-Class SVM** builds an explicit boundary around normal data in feature space. Trains without anomaly examples, but is limited in scalability (up to ~50K points)
- **Autoencoder** detects anomalies by high reconstruction error - ideal for complex data: images, time series, audio
- **Statistical methods** (z-score, IQR, Mahalanobis) are fast baselines, but z-score is sensitive to outliers while Mahalanobis accounts for correlations. Just like Visa blocking fraud in milliseconds, the right method choice depends on data type and task scale - sometimes IQR is enough, other times an ensemble of models is needed
Related Topics
Anomaly detection intersects with clustering, support vectors, and autoencoders. Here are the key connections for deepening understanding:
- K-Means Clustering — K-Means groups data into clusters - points that don't fit any cluster or are far from centroids can be considered anomalies. This is the simplest anomaly detection via clustering
- DBSCAN — DBSCAN explicitly marks noise points that don't belong to any cluster - a direct density-based anomaly detection method, unlike distance (IF) or boundary (OC-SVM) approaches
- Support Vector Machines — One-Class SVM is an adaptation of classical SVM for a one-class task. Understanding the kernel trick and maximum margin helps tune and interpret OC-SVM
- Autoencoders — Full breakdown of autoencoder architectures: variational autoencoder, denoising autoencoder, bottleneck selection - the foundation for deep understanding of anomaly detection through reconstruction
Вопросы для размышления
- If anomalies in your system gradually become the "new normal" (e.g., growing service traffic), how do you need to adapt the anomaly detection model? Which of the four methods is easiest to retrain on new data?
- Fraudsters constantly change their behavior patterns to evade detection. Which of the discussed approaches is most robust against adaptive adversaries and why?
- In medical diagnostics, a missed anomaly (false negative) can cost a life, while a false alarm (false positive) only means an additional test. How would you tune the threshold and contamination, given this asymmetry of errors?
Связанные уроки
- ml-16-clustering-kmeans — Distance to cluster centers flags anomalies
- ml-18-dbscan — Density noise points are natural outliers
- ml-47-model-monitoring — Drift and outlier detection guard models in production
- prob-11-normal — Gaussian tails define statistical outlier thresholds
- stat-05-hypothesis — Flagging an outlier is rejecting the null of normal
- stat-11-bayesian