Machine Learning
K-Nearest Neighbors
A 1951 Air Force report and a tight error bound
Nearest-neighbor classification began in an unpublished 1951 technical report written by Evelyn Fix and Joseph Hodges for the US Air Force School of Aviation Medicine. They studied nonparametric discrimination: classifying a point by looking at the labeled examples nearest to it, with no assumptions about the underlying distribution. The idea sat quietly until 1967, when Thomas Cover and Peter Hart proved a striking result: as the amount of data grows, the error rate of the simple 1-nearest-neighbor rule is never worse than twice the best possible (Bayes) error. That guarantee showed that a method this simple could be provably competitive, and it secured KNN a permanent place in the toolkit.
You have just moved to a new neighborhood and want to understand if it is expensive or cheap. The most natural way - look at the neighboring houses. If the three nearest houses cost 15, 18 and 16 million, yours is probably in the same range. You just applied the KNN algorithm - K-Nearest Neighbors. No formulas, no neural networks, no training. Just looked at the neighbors and drew a conclusion. But why does such a naive approach work? How do you choose how many neighbors to look at? And why does it suddenly break down in a 100-dimensional space?
- **Recommendation systems** (Amazon, Spotify) use KNN-like algorithms: if users with similar tastes bought product X, you might like it too - collaborative filtering is based on finding "nearest neighbors" among users
- **Anomaly detection** in bank transactions: if a transaction is "far" from the K nearest normal operations of a client, it is suspicious - KNN-based anomaly detection is used in fraud detection systems
- **Medical diagnostics:** for rare diseases where there is too little data to train complex models, KNN looks for similar patients with a known diagnosis based on blood tests and symptoms - and often turns out to be more accurate than more complex models
Предварительные знания
Distance metrics
KNN works on one simple idea: **similar objects are close to each other**. If you see an unknown fruit and the three nearest ones by size and color are apples, then it is probably an apple too. But to "find the nearest", you need to define what "close" means. That is what **distance metrics** are for - formulas that turn the intuitive notion of "similarity" into a concrete number.
**Euclidean distance** is the most intuitive metric. This is the "straight-line" distance between two points, familiar from school. For two points with coordinates (x1, y1) and (x2, y2) the formula is: sqrt((x1-x2)^2 + (y1-y2)^2). In the general case for N features: sqrt(sum((xi - yi)^2)). Euclidean distance works well when features have the same scale and when the space is not too high-dimensional.
**Manhattan distance** is the distance "along city streets". Say you are in New York: you cannot walk diagonally through a block, only along streets - right and up. Formula: sum(|xi - yi|). Manhattan distance is more robust to outliers than Euclidean: one extreme feature does not inflate the total distance as much, because there is no squaring.
**Minkowski distance** is a generalization of both metrics. Formula: (sum(|xi - yi|^p))^(1/p). With **p=1** we get Manhattan, with **p=2** - Euclidean. As p -> infinity the metric becomes Chebyshev distance - the maximum difference along any one feature. Minkowski allows smoothly tuning the "character" of distance through the parameter p.
**Cosine similarity** measures not distance but the **angle** between vectors. Two documents with the same words but different lengths will be "far" by Euclidean distance but "close" by Cosine, because their directions match. Formula: cos(theta) = (A . B) / (|A| * |B|). Cosine similarity is indispensable for text and high-dimensional sparse data - where absolute values matter less than proportions.
**Normalization is critical for KNN!** Take two features: salary (30,000–150,000) and age (18–65). Without normalization, salary will completely dominate the distance - a difference of 10,000 in salary will "outweigh" a difference of 20 years in age. KNN will effectively ignore age. **Solution:** StandardScaler (z-score) or MinMaxScaler before feeding data to KNN - always.
Why is it necessary to normalize features before applying KNN?
Choosing K: odd numbers and cross-validation
The parameter **K** in KNN is the number of nearest neighbors the algorithm uses to make a decision. It may seem like a minor detail - but the choice of K determines whether the model memorizes noise or ignores real patterns. This is the classic **bias-variance tradeoff**, and KNN illustrates it as directly as any algorithm.
**K=1** means: "predict the class of exactly the one object that is closest." This makes the model **extremely sensitive to noise**. One mislabeled example - and all new points near it will be classified incorrectly. The decision boundary at K=1 becomes jagged and winding, wrapping around every individual point - classic **overfitting**.
**K=N** (equal to the total number of objects in the training set) - the other extreme. Every new point will be classified as the majority class across the entire dataset, regardless of its position. If 60% of the training data is class A, the model will *always* predict A. This is **underfitting** - the model does not consider the local data structure at all.
For **binary classification** (two classes) K is chosen to be odd - 3, 5, 7... The reason is simple: with even K a tie is possible (2 neighbors of class A and 2 of class B), and the algorithm doesn't know what to predict. An odd K guarantees one class is always in the majority. For tasks with 3+ classes a tie is possible even with odd K, but less likely.
**Empirical rule for starting K:** try K = sqrt(N), where N is the number of training examples. For 1000 examples: sqrt(1000) ~ 31. Then check odd values around it: 29, 31, 33. This rule is not optimal, but gives a reasonable starting point for further tuning via cross-validation.
**Cross-validation** is the only reliable way to choose K. The idea: split the data into, say, 5 parts (folds). For each candidate K (say, from 1 to 30): train on 4 parts, test on the 5th, repeat 5 times, average the accuracy. The K with the best average accuracy wins. This is **GridSearchCV** in scikit-learn.
What happens if K is set equal to the total number of training examples?
The curse of dimensionality
KNN works great in 2-3 dimensions, but as the number of features grows something counterintuitive happens: **in high-dimensional spaces all points become equally far from each other**. This phenomenon is called the **curse of dimensionality**, and it calls into question the very idea of "nearest neighbors".
Consider the unit square [0,1]x[0,1]. A sphere (circle) of radius 0.5 occupies pi/4 ~ 78.5% of the square's area. Now take the unit cube [0,1]^3 - a sphere of radius 0.5 occupies only 52.4% of the volume. In 10 dimensions a hypersphere occupies only **0.25%** of the hypercube's volume. In 100 dimensions - a number so small it is effectively zero. Data "spreads" to the corners of the hypercube, and the central region empties out.
More formally: in a d-dimensional space the distance to the nearest neighbor (d_min) and the farthest (d_max) converge. The ratio (d_max - d_min) / d_min approaches zero as d grows. This means the notion of "nearest neighbor" loses meaning - all neighbors are roughly "equally far". KNN starts to degrade at around **20-30 features**, though the exact threshold depends on data volume.
**Solutions to the dimensionality problem for KNN:** 1. **PCA (Principal Component Analysis)** - compress 100 features to 10-20 principal components, retaining 95% of the variance 2. **Feature Selection** - select only the most informative features (mutual information, chi-squared test) 3. **Increase data volume** - but for d dimensions you need exponentially more data: N ~ exp(d) 4. **Use a different algorithm** - decision trees, Random Forest, gradient boosting do not suffer from the curse of dimensionality as severely
Why does KNN degrade in high-dimensional spaces (50+ features)?
Weighted KNN and practice
In standard KNN all K neighbors have equal votes: the nearest at distance 0.1 and the farthest at distance 5.0 influence the prediction equally. This is unfair - a closer neighbor should matter more. **Weighted KNN** solves this: each neighbor votes with a weight inversely proportional to its distance. A neighbor at distance 0.1 gets weight 10, while one at distance 5.0 gets weight 0.2.
KNN works not only for classification, but also for **regression**. Instead of voting for a class, the algorithm takes the average value of the target variable of the K nearest neighbors. For example, to predict apartment price: find the 5 nearest by area, floor, neighborhood - and average their prices. With weights: KNeighborsRegressor(weights='distance') - closer apartments influence the prediction more.
KNN is a **lazy learning** algorithm. Unlike linear regression or neural networks, KNN **has no training phase**. It simply memorizes all training data. All the work happens at prediction time: for every new object, distances to *all* training examples must be computed and K nearest found. This makes **training instant, but prediction slow** - O(N*d) for each query.
**Data structures for speeding up KNN:** - **KD-Tree** - divides space into rectangular regions. Speeds up search to O(d * log N) instead of O(N * d). Works well when d < 20 - **Ball Tree** - divides into hyperspheres. Better than KD-Tree when d > 20, but worse in low dimensions - **Brute Force** - exhaustive search. The only option for very large d or special metrics In scikit-learn: `KNeighborsClassifier(algorithm='kd_tree')` or `algorithm='auto'` (it chooses automatically).
KNN is too simple for real tasks - it is just a teaching algorithm; in practice neural networks or boosting are always better
KNN is a powerful baseline and production-ready algorithm for tasks with a small number of features and data volume; it is widely used in recommendation systems, anomaly detection and as a component of ensembles
Algorithm complexity does not guarantee superiority. On small datasets with nonlinear boundaries KNN can outperform neural networks that overfit. In recommendation systems (collaborative filtering) KNN is one of the standard approaches. And in anomaly detection tasks, the distance to the K-th neighbor is a direct indicator of how unusual an object is.
Why is KNN called a "lazy learning" algorithm?
Key ideas
- **Distance metrics** define what "similar" means: Euclidean (straight line), Manhattan (along streets), Cosine (by angle) - the choice of metric and mandatory feature normalization critically affect the result
- **Choosing K** is the main hyperparameter: K=1 memorizes noise (overfitting), K=N ignores structure (underfitting), optimal K is found via cross-validation, and for binary classification K must be odd
- **Curse of dimensionality** makes KNN useless at 30+ features - all points become equally far, and PCA or feature selection are necessary to reduce dimensionality
- **Weighted KNN** (weights='distance') gives closer neighbors more weight, while the algorithm itself is a lazy learner with no training: instant fit but slow predict. Like evaluating a house by its neighbors - the method is simple and intuitive, which is exactly why it is still used in recommendations, anomaly detection and medical diagnostics
Related topics
KNN is closely connected to data preprocessing, model evaluation and other classification algorithms:
- Data Preprocessing — Normalization (StandardScaler, MinMaxScaler) is critical for KNN - without it, features at different scales distort distances
- Support Vector Machines — SVM also uses the concept of distance, but builds a separating hyperplane - unlike KNN, which decides locally based on neighbors
- PCA - Principal Component Analysis — PCA reduces dimensionality and solves the curse of dimensionality - the key problem of KNN with many features
- Model Evaluation — Cross-validation, accuracy, precision, recall - tools for choosing the optimal K and evaluating KNN quality
Вопросы для размышления
- If KNN has no training phase and simply memorizes data, can it be considered a machine learning algorithm? Where is the boundary between "memorizing" and "learning"?
- In what situations could a simple KNN with K=5 be more accurate than a deep neural network? Think about data size, number of features and complexity of the class boundary.
- How would you apply the idea of "nearest neighbors" to data where ordinary distance metrics don't work - for example, to social network graphs or DNA sequences?
Связанные уроки
- ml-04-data-preprocessing — Data normalization is critical for KNN
- ml-13-svm — SVM and KNN are both instance-based but different principles
- ml-19-pca — PCA reduces dimensionality before KNN for curse of dimensionality
- ir-06 — Semantic search is KNN in embedding space
- aie-09-embeddings — Vector search in RAG is KNN with neural embeddings
- alg-10-binary-search — ANN indices (HNSW) speed up KNN like binary search speeds up linear search