Machine Learning

Support Vector Machines

Three steps to the support vector machine

The SVM was built in three steps over three decades. In 1963 Vladimir Vapnik and Alexey Chervonenkis developed VC theory, the statistical framework explaining why a wide margin between classes leads to better generalization. In 1992 Bernhard Boser, Isabelle Guyon, and Vapnik added the kernel trick: by replacing dot products with kernel functions, a linear separator could carve out nonlinear boundaries without ever computing high-dimensional coordinates. Then in 1995 Corinna Cortes and Vapnik introduced the soft-margin SVM, allowing a few misclassifications so the method could cope with noisy, overlapping data. The result dominated practical classification until deep learning took over a decade later.

You are a security guard in a museum and need to stretch a rope between two groups of exhibits so that the distance from the rope to the nearest exhibit is maximized. The larger the gap - the more reliable the separation: a random visitor won't confuse the zones. That is exactly how SVM works - it doesn't just look for any boundary between classes, but for a boundary with the maximum safety margin. And if the exhibits are mixed and a rope cannot separate them? Then SVM uses the kernel trick - a mathematical trick that allows separating the inseparable.

  • **Handwritten digit recognition** - SVM was the standard for the MNIST task (70,000 images of digits 0-9) before deep learning, achieving 98.5% accuracy with RBF kernel, and is still used as a baseline in image classification tasks
  • **Text classification** - spam filters, sentiment analysis of reviews, news categorization: SVM with linear kernel handles thousands of features (words) and remains one of the best methods for text tasks with limited data

0

1

Sign In

  • **Bioinformatics** - SVM classifies proteins by function, predicts gene activity and diagnoses diseases from genetic profiles, where typically 20,000+ features (genes) and only hundreds of patient samples are available
  • Предварительные знания

    • Logistic Regression

    Hyperplane and maximum margin

    Say you need to draw a line between two groups of points on a plane - red and blue. LogisticRegression will find *some* line that separates the classes. But SVM sets a stricter goal: find a line that not only separates them, but does so with the **maximum margin** between the nearest points of different classes. Why does this matter? A large margin means the model is confident in its decision and generalizes better to new data.

    **Support vectors** are the points of each class that are closest to the separating line. They literally "support" the hyperplane and determine its position. A key property of SVM: if you remove any point that is not a support vector, the decision boundary **does not change**. Only support vectors affect the model - all other points can be deleted without consequence.

    Mathematically the SVM task is: **minimize ||w||^2 / 2** subject to all points being correctly classified: y_i * (w * x_i + b) >= 1 for each point. Here w is the normal vector to the hyperplane and b is the bias. This is a **quadratic optimization problem with constraints** - it has a unique global solution (unlike neural networks where you can get stuck in a local minimum).

    **Hard margin vs Soft margin (C-SVM):** Hard margin SVM requires ALL points to be strictly outside the margin. But real data often contains outliers or class overlap. **Soft margin** allows violations at the cost of penalty C: - **Large C** (e.g., 1000) - the model penalizes violations heavily, margin is narrow, possible overfitting - **Small C** (e.g., 0.01) - the model allows more violations, margin is wide, better generalization C is a trade-off between margin width and the number of training set errors.

    What happens to the SVM separating hyperplane if you remove a point that is NOT a support vector?

    Kernel Trick: nonlinear boundaries

    Linear SVM works great when classes can be separated by a straight line. But what if the data looks like concentric circles - one class inside, the other outside? Or the XOR task - four points on a plane where diagonally opposite ones belong to the same class? No line will separate them. The solution idea: **project data into a higher-dimensional space** where it becomes linearly separable. For example, for points (x1, x2) we can add the feature x3 = x1^2 + x2^2 (distance from center). In 3D, points of the inner circle will sit lower and the outer circle higher, and a plane in 3D easily separates them.

    But projecting to a high-dimensional space is expensive: if there are d original features, a polynomial projection of degree p creates on the order of d^p new features. For 100 features and degree 5, that is 10 billion computations per point. **Kernel trick** solves this problem elegantly: it allows computing the dot product of two points *in the high-dimensional space* **without ever going there explicitly**.

    **The essence of the kernel trick:** SVM does not need the actual coordinates in the high-dimensional space - it only needs **dot products** between points: phi(x) * phi(z). A kernel function K(x, z) computes this dot product directly: - K(x, z) = phi(x) * phi(z) - without computing phi! **Example:** for phi(x1, x2) = (x1^2, x2^2, sqrt(2)*x1*x2) - Compute phi and dot product: 6 ops for phi, 3 for dot product = 9 - Kernel: K(x, z) = (x * z)^2 = (x1*z1 + x2*z2)^2 = **3 operations** Same result, but kernel is 3x faster. For high dimensions the difference is billions of times.

    Popular kernel functions: **linear** K(x,z) = x*z (ordinary dot product, no projection), **polynomial** K(x,z) = (x*z + c)^d (projection into degree-d polynomial space), and **RBF** (radial basis function), covered next. The choice of kernel depends on the task: linear - when data is linearly separable, polynomial - for moderate nonlinearities, RBF - the universal default option.

    What is the main advantage of the kernel trick compared to explicitly projecting data into a high-dimensional space?

    RBF Kernel: Radial Basis Function

    RBF (Radial Basis Function) kernel is the most popular and powerful kernel for SVM. Its formula: **K(x, z) = exp(-gamma * ||x - z||^2)**, where ||x - z||^2 is the squared Euclidean distance between points. Intuitively: RBF kernel measures **similarity** between two points. If points are close - K is large (near 1), if far - K approaches 0. The parameter gamma determines how quickly "similarity" drops with distance.

    A key property of RBF kernel: it corresponds to a projection into an **infinite-dimensional space**. Through the Taylor series expansion of exp(-gamma * ||x-z||^2) one can show that RBF is equivalent to a dot product in a space with infinitely many coordinates. This means that RBF SVM can theoretically separate *any* data that is in principle separable (with the right parameters).

    **Parameter gamma - the "radius of influence" of each point:** - **Small gamma** (0.001): each point influences a large region. Decision boundary is smooth and simple. May lead to underfitting. - **Medium gamma** (0.1): balance between smoothness and complexity. Usually the best result. - **Large gamma** (100): each point influences only a tiny neighborhood. Decision boundary becomes complex, winding, following every point. Classic overfitting. Rule: default gamma in sklearn = 1 / (n_features * X.var()), which often gives a good starting point.

    RBF kernel is the **default in sklearn.svm.SVC**, and for good reason. It is universal: linear kernel is a special case of RBF (at very small gamma, RBF behaves almost linearly). At the same time, RBF handles complex nonlinear boundaries that linear and polynomial kernels cannot. The only downside is two hyperparameters (C and gamma) that need tuning.

    What happens if you set a very large gamma value in RBF SVM?

    Hyperparameters and SVM in practice

    SVM with RBF kernel has two key hyperparameters: **C** (penalty for margin violation) and **gamma** (radius of influence of points). They interact with each other: increasing both at the same time leads to overfitting, decreasing both leads to underfitting. The right combination of C and gamma is the key to a good model, and it must be found through **grid search with cross-validation**.

    **Feature scaling is mandatory for SVM!** SVM computes distances between points. If one feature ranges from 0 to 1 and another from 0 to 1,000,000, the second completely dominates. StandardScaler (z-score) or MinMaxScaler brings all features to the same scale. This distinguishes SVM from decision trees and Random Forest, which are NOT sensitive to feature scale (they work with thresholds, not distances).

    **When to use SVM?** Best choice when there is little data (thousands to tens of thousands) but many features - text classification, bioinformatics (thousands of genes, hundreds of patients). In such conditions neural networks overfit, while SVM with the right kernel generalizes. **When NOT to use?** With large data volumes (>100K points) - SVM training has complexity O(n^2) to O(n^3), making it impractical on millions of examples. Also, SVM is not suitable when interpretability is needed - linear models and trees are much more transparent.

    **Practical SVM tips:** 1. **Always scale the data** - StandardScaler before SVM 2. **Start with RBF kernel** - it's the default for a reason 3. **Grid search for C and gamma** - typical ranges: C from 0.01 to 1000, gamma from 0.0001 to 10 4. **Little data, many features** - SVM is in its element 5. **Large data (>100K)** - consider LinearSVC or SGDClassifier with hinge loss 6. **Need probabilities** - use SVC(probability=True), but this doubles training time via Platt scaling

    SVM is outdated and unnecessary because neural networks solve all tasks better

    SVM remains the best choice for small data volumes with high dimensionality, where neural networks overfit but SVM generalizes thanks to the maximum margin principle

    Neural networks require large datasets to train millions of parameters. With 500-5000 samples SVM often outperforms neural networks. In bioinformatics, medical diagnostics and text analysis SVM is still competitive. No Free Lunch: there is no universally better algorithm.

    For which task would SVM (with RBF kernel) be the most suitable choice?

    Key ideas

    • **Maximum margin:** SVM finds the hyperplane that separates classes with the largest gap - only support vectors (nearest points) determine the boundary, all other data can be deleted
    • **Kernel trick:** allows computing a dot product in high-dimensional space without going there explicitly - polynomial kernel solves nonlinear tasks at a fraction of the cost of explicit projection
    • **RBF kernel:** measures point similarity through distance, projecting into infinite-dimensional space - parameter gamma controls the radius of influence of each point (small gamma = smooth boundary, large = overfitting)
    • **Practice:** feature scaling is mandatory, grid search for C and gamma via cross-validation, SVM is best with small data and high dimensionality - like the rope in the museum from our example, it finds the most reliable separation with the maximum safety margin

    Related topics

    SVM sits at the intersection of linear methods and nonlinear transformations, connecting classical algorithms with modern approaches:

    • Logistic Regression — The linear classifier predecessor of SVM: both look for a separating hyperplane, but SVM maximizes margin while Logistic Regression optimizes likelihood. For linearly separable data SVM gives greater robustness
    • Decision Trees — An alternative approach to nonlinear classification without kernel trick: trees split space into rectangles, require no scaling, but are less robust to noise. Random Forest solves the robustness issue through ensembling
    • PCA (Principal Component Analysis) — Dimensionality reduction before SVM: with thousands of features PCA can speed up training while retaining most of the data variance. Kernel PCA uses the same kernel trick as SVM
    • Neural Networks — A powerful SVM alternative for large datasets: neural networks scale to millions of examples but require more data and compute. With small data SVM often wins thanks to the maximum margin principle

    Вопросы для размышления

    • Why does SVM, which maximizes margin, usually generalize better than a classifier that simply finds any separating boundary? How does this relate to Vapnik's statistical learning theory?
    • The kernel trick allows working in an infinite-dimensional space (RBF). Why doesn't the model automatically overfit, given that in an infinite-dimensional space any data can be separated?
    • In which situations would you choose SVM over Random Forest, and vice versa? What data properties determine this choice?

    Связанные уроки

    • ml-10-logistic-regression — Logistic regression is the starting point for understanding SVM margin
    • ml-09-gradient-descent — SVM dual problem optimization uses gradient methods
    • ml-19-pca — PCA + SVM is a classical pipeline for high dimensions
    • ml-25-neural-networks — SVM kernel trick intuition transfers to neural feature extractors
    • calc-06-derivative-intro — Margin optimization requires understanding derivatives
    • prob-04-bayes — SVM and Naive Bayes are two polar approaches to classification
    • la-25-quadratic-forms
    • cvx-01
    Support Vector Machines