Machine Learning
Naive Bayes
Every day Gmail filters 15 billion spam emails. At the core of one of the earliest and still-working approaches is an algorithm that makes a knowingly incorrect assumption - that all words in an email appear independently of each other. This is mathematically incorrect: the words "free" and "discount" directly correlate. But the algorithm built on this "naive" assumption classifies spam with over 95% accuracy. How does a flawed premise lead to correct results?
- **Gmail spam filter** started with Naive Bayes (Paul Graham's essay "A Plan for Spam", 2002) and processes 15+ billion emails daily - NB is still used as one of the filtering layers
- **Medical diagnostics:** Bayesian classifiers help doctors estimate the probability of a disease based on symptoms - a medical test with 95% accuracy for a rare disease (1% prevalence) gives only a 16% probability of disease, and without Bayes' theorem doctors systematically misinterpret results
- **Real-time sentiment analysis:** companies monitor millions of social media mentions using NB classifiers, determining positive and negative reviews in milliseconds - where BERT would require a GPU, NB runs on a regular CPU
From a posthumous essay to spam filters
Reverend Thomas Bayes never published the theorem that bears his name. He died in 1761, and two years later his friend Richard Price presented the manuscript "An Essay towards solving a Problem in the Doctrine of Chances" to the Royal Society. The work went largely unnoticed until Pierre-Simon Laplace independently rediscovered and generalized the idea around 1774, giving it the modern form and applying it to real problems in astronomy and demographics. For almost two centuries Bayesian reasoning stayed mostly theoretical. The practical revival came with text: in 1998 Sahami and colleagues published a Bayesian approach to filtering junk email, and in 2002 Paul Graham's essay "A Plan for Spam" turned the idea into a movement, showing that a simple word-counting classifier could stop spam better than hand-written rules. A theorem from a clergyman's notebook became the first line of defense in every inbox.
Предварительные знания
Bayes' Theorem and the Naive Assumption
In 1763, two years after the death of Reverend Thomas Bayes, his friend Richard Price published a manuscript that changed mathematics forever. Bayes' theorem answers a fundamental question: **how do we update our beliefs when we receive new data?** The formula is surprisingly simple, but its implications permeate everything - from medical diagnostics to spam filters.
The result of the medical example seems paradoxical: a test with 95% accuracy, yet a positive result gives only a 16% probability of disease! The key is the **prior**: the disease is rare (1%), so even an accurate test produces many false positives. This is the core insight of Bayes - **context (prior) radically affects the interpretation of data**.
Now let's move to classification. Suppose an object has features x1, x2, ..., xn, and we want to determine class c. By Bayes' theorem: P(c|x1,x2,...,xn) is proportional to P(x1,x2,...,xn|c) * P(c). The problem: **computing P(x1,x2,...,xn|c) directly is infeasible** - you'd need to store joint distributions for all feature combinations. With 20 binary features, that's 2^20 = 1,048,576 parameters per class.
**The Naive assumption** solves the dimensionality problem with an elegant hack: All features are **conditionally independent** given the class: P(x1, x2, ..., xn | c) = P(x1|c) * P(x2|c) * ... * P(xn|c) Instead of 2^20 parameters, you only need 20 * K parameters (K = number of classes). This simplification is **mathematically wrong** almost always - features in real data are dependent. The words "free" and "discount" in spam are correlated. Height and weight are correlated. But **Naive Bayes works in practice!** Probability estimates are imprecise, but the *ranking* of classes is often correct - we don't need exact probabilities, we need the correct argmax.
**Zero probability problem:** in the example above P("discount" | not_spam) = 0. This zeroes out *the entire* product for the not_spam class, even if all other features point to not-spam. The solution is **Laplace smoothing**, which we'll discuss in the Multinomial NB concept.
Why is the algorithm called "Naive"?
Gaussian Naive Bayes
We've understood the general idea of Naive Bayes, but a question arises: **how do we compute P(x|class) when x is a continuous number?** For example, a person's height - 175.3 cm. The probability of *exactly* that value in a continuous distribution is zero. Gaussian Naive Bayes solves this by assuming that the values of each feature in each class follow a **normal (Gaussian) distribution**.
Training Gaussian NB is one of the fastest among all ML algorithms. You only need to **compute the mean and standard deviation** of each feature for each class. No iterations, no gradient descent, no hyperparameters to tune. A single pass through the data - and the model is ready. On a dataset of one million samples, training takes fractions of a second.
**When Gaussian NB works well:** - Features are actually approximately normally distributed - Little training data (only need to estimate mean and std) - A quick baseline for comparison with other models - High-dimensional task (many features) **When Gaussian NB works poorly:** - Features are strongly correlated (violating the "naive" assumption) - Feature distribution is far from normal (e.g., bimodal) - High accuracy near the decision boundary is required
Gaussian NB is often used as the **first baseline**: if Gaussian NB achieves 92% accuracy on the data and a complex model achieves 93%, it's worth asking whether the added complexity is justified. If Gaussian NB gives 60% while LogReg gives 90%, then the assumption of normality and independence doesn't hold for this data, and more flexible algorithms are needed.
What does a trained Gaussian Naive Bayes model store for each feature in each class?
Multinomial and Bernoulli Naive Bayes
Gaussian NB works great for continuous features, but what if the data consists of **counts** or **binary flags**? For example, how many times a word appears in a document, or whether a word is present or absent. For such data, two specialized variants exist: **Multinomial NB** (for counts) and **Bernoulli NB** (for binary features).
**Multinomial NB** models data as the result of repeated "dice rolls" - each word in a document is drawn from a multinomial distribution. If the word "discount" appears 5 times in one email, that's more informative than a single mention. Multinomial NB accounts for word **frequency** and is the standard for text classification.
**Bernoulli NB** models each feature as binary: a word is either present or absent. An important detail: Bernoulli NB **explicitly models the absence** of a feature - it computes P(word NOT present | class). For short texts (tweets, SMS) or tasks where the mere presence of a word matters more than its frequency, Bernoulli NB can outperform Multinomial.
**Laplace Smoothing:** The **alpha** parameter (default = 1) solves the critical zero-probability problem. Without smoothing: if the word "blockchain" never appeared in spam, then P("blockchain"|spam) = 0. Due to probability multiplication, **the entire class is zeroed out**, even if all other words scream "spam". With alpha = 1 (Laplace smoothing): 1 is added to every count. No probability will be zero. - alpha = 0: no smoothing (dangerous) - alpha = 1: Laplace smoothing (standard) - alpha < 1: Lidstone smoothing (less aggressive) - alpha = large number: all probabilities approach uniform
In practice, alpha is tuned via **cross-validation**. For texts, optimal values typically lie in the range 0.01--1.0. It's important to remember: Multinomial NB requires **non-negative** feature values (counts), so it can't be applied directly to TF-IDF weights without additional transformations, although sklearn handles this correctly.
Why is the alpha parameter (Laplace smoothing) needed in Multinomial Naive Bayes?
Text Classification
Naive Bayes found its main niche in **text classification** - and has no intention of leaving it. Despite decades of ML advances, from SVM to transformers, Naive Bayes remains a practical choice for many text tasks. The reason: texts create **high-dimensional sparse data** (tens of thousands of unique words, but each document contains only a small fraction of the vocabulary) - and that's exactly where the "naive" assumption works best.
**Bag of Words (BoW)** is the simplest vectorization: each document is represented as a vector of word frequencies. Word order is completely ignored - "the dog bit the man" and "the man bit the dog" get the same vector. Despite this information loss, BoW works surprisingly well for classification tasks.
**TF-IDF** improves BoW by weighting words by their informativeness. The word "and" appears everywhere - its weight is low. The word "interference" appears only in physics texts - its weight is high. TF-IDF automatically highlights words that **distinguish** one class of documents from another. Combined with Multinomial NB, this produces a powerful and fast classifier.
**Why Naive Bayes excels specifically at text:** 1. **High dimensionality:** a vocabulary of 50,000+ words = 50,000 features. Many algorithms suffer from the curse of dimensionality, but NB does not 2. **Sparse data:** each document contains a small fraction of the vocabulary. NB handles sparsity naturally 3. **Little data:** NB only needs to estimate P(word|class) for each word. LogReg and SVM need to estimate weights in a high-dimensional space 4. **Conditional independence is approximately valid:** words in text are less dependent than, say, pixels in an image. "Free" and "discount" are correlated, but not as strongly as neighboring pixels
In real-world tasks, Naive Bayes is used as the first line: **spam filters** (Gmail started with NB), **sentiment analysis** (positive/negative review), **document classification** by topic, **automatic routing** of support requests. For large-scale systems that need to classify millions of documents in real time, NB's speed becomes a critical advantage.
Naive Bayes is useless because the feature independence assumption is almost never satisfied in real data
Naive Bayes delivers competitive results in practice, especially on text, despite violating the independence assumption. Exact probabilities are wrong, but the class ranking (argmax) is often correct
For classification, what matters is not exact P(class|x), but only which class has maximum probability. Errors in probability estimates often "cancel out" when comparing classes. Research (Domingos & Pazzani, 1997) proved that NB is optimal by accuracy far more often than theory predicts
Why does the TF-IDF + MultinomialNB combination work so well on text?
Key Ideas
- **Bayes' theorem** inverts probability: instead of P(symptoms|disease) it computes P(disease|symptoms), updating the prior with new data (evidence)
- **The naive assumption** (feature independence) is mathematically incorrect, but allows training in a single pass through the data and works in practice because classification requires the argmax, not exact probabilities
- **Three NB variants:** Gaussian (continuous data, mean/std), Multinomial (counts, word frequencies), Bernoulli (binary features, word presence/absence) - choice depends on data type
- **Laplace smoothing (alpha)** prevents zero probabilities: one unseen word doesn't zero out the entire class
- **NB + TF-IDF** - the standard pipeline for text classification: loses 3-4% accuracy to LogReg and SVM but is 10-15x faster - which is exactly why Gmail started with Naive Bayes to filter 15 billion emails per day, as mentioned in the intro
Related Topics
Naive Bayes sits at the intersection of probability theory and practical classification. Here are topics that deepen the understanding:
- Mathematical Foundations of ML — Bayes' theorem, conditional probability, and distributions - the mathematical foundation on which Naive Bayes is built
- Logistic Regression — The main competitor of NB for text classification - a discriminative model that directly models P(class|features) without estimating P(features|class)
- Text Preprocessing — Tokenization, stemming, stop word removal - the steps that precede TF-IDF and Naive Bayes in a text classification pipeline
- Word Embeddings — A modern alternative to Bag of Words: dense word vectors (Word2Vec, GloVe) capture semantics that TF-IDF loses
Вопросы для размышления
- If Naive Bayes assumes feature independence, but in reality words are dependent ("New" is almost always next to "York") - why does the algorithm still work? In what situations does the violation of the independence assumption become critical?
- A doctor receives a positive test result (99% accuracy) for a rare disease (0.1% prevalence). Using Bayes' theorem, what is the actual probability that the patient is sick? How does this affect medical decision-making?
- Today BERT and transformers are increasingly used for text classification. In what scenarios is Naive Bayes still preferable, despite lower accuracy?
Связанные уроки
- ml-03-math-foundations — Bayes formula requires understanding conditional probability
- prob-04-bayes — Bayes theorem is the mathematical foundation of the classifier
- ml-10-logistic-regression — Naive Bayes and LR are two approaches to the same classification task
- ml-34-text-preprocessing — Spam filtering is the classic NB application
- it-03 — Naive Bayes loss connects to cross-entropy via log-likelihood
- ml-07-polynomial-regression — Both work with assumptions about data distributions