Machine Learning

Support Vector Machines

Three steps to the support vector machine

The SVM was built in three steps over three decades. In 1963 Vladimir Vapnik and Alexey Chervonenkis developed VC theory, the statistical framework explaining why a wide margin between classes leads to better generalization. In 1992 Bernhard Boser, Isabelle Guyon, and Vapnik added the kernel trick: by replacing dot products with kernel functions, a linear separator could carve out nonlinear boundaries without ever computing high-dimensional coordinates. Then in 1995 Corinna Cortes and Vapnik introduced the soft-margin SVM, allowing a few misclassifications so the method could cope with noisy, overlapping data. The result dominated practical classification until deep learning took over a decade later.

You are a security guard in a museum and need to stretch a rope between two groups of exhibits so that the distance from the rope to the nearest exhibit is maximized. The larger the gap - the more reliable the separation: a random visitor won't confuse the zones. That is exactly how SVM works - it doesn't just look for any boundary between classes, but for a boundary with the maximum safety margin. And if the exhibits are mixed and a rope cannot separate them? Then SVM uses the kernel trick - a mathematical trick that allows separating the inseparable.

**Handwritten digit recognition** - SVM was the standard for the MNIST task (70,000 images of digits 0-9) before deep learning, achieving 98.5% accuracy with RBF kernel, and is still used as a baseline in image classification tasks
**Text classification** - spam filters, sentiment analysis of reviews, news categorization: SVM with linear kernel handles thousands of features (words) and remains one of the best methods for text tasks with limited data

Kernel Trick: nonlinear boundaries

Linear SVM works great when classes can be separated by a straight line. But what if the data looks like concentric circles - one class inside, the other outside? Or the XOR task - four points on a plane where diagonally opposite ones belong to the same class? No line will separate them. The solution idea: **project data into a higher-dimensional space** where it becomes linearly separable. For example, for points (x1, x2) we can add the feature x3 = x1^2 + x2^2 (distance from center). In 3D, points of the inner circle will sit lower and the outer circle higher, and a plane in 3D easily separates them.

But projecting to a high-dimensional space is expensive: if there are d original features, a polynomial projection of degree p creates on the order of d^p new features. For 100 features and degree 5, that is 10 billion computations per point. **Kernel trick** solves this problem elegantly: it allows computing the dot product of two points *in the high-dimensional space* **without ever going there explicitly**.

**The essence of the kernel trick:** SVM does not need the actual coordinates in the high-dimensional space - it only needs **dot products** between points: phi(x) * phi(z). A kernel function K(x, z) computes this dot product directly: - K(x, z) = phi(x) * phi(z) - without computing phi! **Example:** for phi(x1, x2) = (x1^2, x2^2, sqrt(2)*x1*x2) - Compute phi and dot product: 6 ops for phi, 3 for dot product = 9 - Kernel: K(x, z) = (x * z)^2 = (x1*z1 + x2*z2)^2 = **3 operations** Same result, but kernel is 3x faster. For high dimensions the difference is billions of times.

Popular kernel functions: **linear** K(x,z) = x*z (ordinary dot product, no projection), **polynomial** K(x,z) = (x*z + c)^d (projection into degree-d polynomial space), and **RBF** (radial basis function), covered next. The choice of kernel depends on the task: linear - when data is linearly separable, polynomial - for moderate nonlinearities, RBF - the universal default option.

What is the main advantage of the kernel trick compared to explicitly projecting data into a high-dimensional space?

Hyperparameters and SVM in practice

SVM with RBF kernel has two key hyperparameters: **C** (penalty for margin violation) and **gamma** (radius of influence of points). They interact with each other: increasing both at the same time leads to overfitting, decreasing both leads to underfitting. The right combination of C and gamma is the key to a good model, and it must be found through **grid search with cross-validation**.

**Feature scaling is mandatory for SVM!** SVM computes distances between points. If one feature ranges from 0 to 1 and another from 0 to 1,000,000, the second completely dominates. StandardScaler (z-score) or MinMaxScaler brings all features to the same scale. This distinguishes SVM from decision trees and Random Forest, which are NOT sensitive to feature scale (they work with thresholds, not distances).

**When to use SVM?** Best choice when there is little data (thousands to tens of thousands) but many features - text classification, bioinformatics (thousands of genes, hundreds of patients). In such conditions neural networks overfit, while SVM with the right kernel generalizes. **When NOT to use?** With large data volumes (>100K points) - SVM training has complexity O(n^2) to O(n^3), making it impractical on millions of examples. Also, SVM is not suitable when interpretability is needed - linear models and trees are much more transparent.

**Practical SVM tips:** 1. **Always scale the data** - StandardScaler before SVM 2. **Start with RBF kernel** - it's the default for a reason 3. **Grid search for C and gamma** - typical ranges: C from 0.01 to 1000, gamma from 0.0001 to 10 4. **Little data, many features** - SVM is in its element 5. **Large data (>100K)** - consider LinearSVC or SGDClassifier with hinge loss 6. **Need probabilities** - use SVC(probability=True), but this doubles training time via Platt scaling

SVM is outdated and unnecessary because neural networks solve all tasks better

SVM remains the best choice for small data volumes with high dimensionality, where neural networks overfit but SVM generalizes thanks to the maximum margin principle

Neural networks require large datasets to train millions of parameters. With 500-5000 samples SVM often outperforms neural networks. In bioinformatics, medical diagnostics and text analysis SVM is still competitive. No Free Lunch: there is no universally better algorithm.

For which task would SVM (with RBF kernel) be the most suitable choice?

Support Vector Machines

Three steps to the support vector machine

Support Vector Machines

Three steps to the support vector machine

Предварительные знания

Hyperplane and maximum margin

Kernel Trick: nonlinear boundaries

RBF Kernel: Radial Basis Function

Hyperparameters and SVM in practice

Key ideas

Related topics

Вопросы для размышления

Связанные уроки