Natural Language Processing
Bag of Words and TF-IDF
April 2004. Gmail launches in beta with 1 GB storage - while competitors offered 250 MB. One million users in a month. All of it held together by TF-IDF: a spam filter that decides every email's fate in 2 milliseconds, before the heavy neural nets even get a shot. An algorithm invented in the 1970s was protecting a billion inboxes in 2024.
- **Elasticsearch** (BM25 = improved TF-IDF) - the search engine for Wikipedia, GitHub, Netflix, Stack Overflow
- **Gmail spam filters** use TF-IDF + Naive Bayes as a first fast filter before heavy neural networks
- **Amazon recommendation systems** compare TF-IDF of product descriptions for the "Similar items" section
Gerard Salton and the Invention of TF-IDF
Gerard Salton at Cornell University built SMART (System for the Mechanical Analysis and Retrieval of Text) - the first automated information retrieval system. TF-IDF crystallized out of this work in 1972. Salton wanted an answer to: how can a machine tell that a document about quantum mechanics is relevant to a query, while one about "the" is not? The answer turned out to be mathematically compact: penalize frequent words, reward rare ones. Fifty years later the algorithm sits inside Elasticsearch, chewing through trillions of queries per year.
Предварительные знания
Bag of Words
**"The dog bit the man"** and **"The man bit the dog"** - for BoW these are IDENTICAL texts. Surprised? **Bag of Words** is one of the simplest yet effective text representations. The idea: ditch word order and just count how often each word appears.
**BoW's biggest casualty: word order.** "The dog bit the man" (news) and "The man bit the dog" (sensation) end up with identical vectors. For tasks where order is critical (translation, generation), BoW is a non-starter. For topic classification - it works great.
**Bigrams and trigrams** (n-grams) are a simple BoW extension that partially restores order. "not good" as a single token carries more information than "not" + "good" separately. But vocabulary size explodes.
**CountVectorizer** in sklearn handles tokenization, lowercasing, and frequency counting automatically. Knobs: `max_features` (cap vocabulary), `min_df`/`max_df` (filter by frequency), `stop_words` (drop stop words), `ngram_range` (n-grams).
Two documents: "cat eats fish" and "fish eats cat". Which statement is true for BoW (CountVectorizer)?
TF-IDF
The word "the" appears in every English document. The word "quantum" appears in maybe one in a thousand. Which word is more useful for search? **TF-IDF** (Term Frequency - Inverse Document Frequency) fixes BoW's central problem: not all words pull equal weight. Words that show up everywhere get low weight; rare and unique ones get high weight.
**TF-IDF formula:** `TF-IDF(t, d) = TF(t, d) x IDF(t)` - **TF** measures word importance *within* the document (local weight) - **IDF** measures word uniqueness *across the entire collection* (global weight) - Net result: high weight for words that are frequent in THIS document but rare elsewhere
| Word | Appears in documents | IDF weight | Interpretation |
|---|---|---|---|
| "the" | 9999 out of 10000 | ~0.0001 | Useless for search |
| "machine" | 500 out of 10000 | ~3.0 | Medium informativeness |
| "quantum" | 10 out of 10000 | ~6.9 | Very informative |
| "supercalifragilistic" | 1 out of 10000 | ~9.2 | Unique document marker |
**sklearn TfidfVectorizer** defaults to a smoothed IDF: `log((1+N)/(1+df)) + 1`. Prevents division by zero and produces slightly different numbers than the classical formula. Pass `smooth_idf=False` to turn smoothing off.
In a collection of 1000 documents, the word "algorithm" appears in 10 documents, and the word "and" in 999. Which word will have a higher TF-IDF weight (given the same TF)?
Document-Term Matrix
BoW and TF-IDF produce a **Document-Term Matrix** - a table where rows = documents, columns = terms, values = frequencies or TF-IDF weights. This matrix is the foundation for finding similar documents, clustering, and text classification.
**Cosine Similarity** measures the angle between two vectors: `cos(A, B) = (A . B) / (||A|| x ||B||)`. Values run from 0 (nothing in common) to 1 (identical). Unlike Euclidean distance, cosine similarity does not depend on document length - a long document and its summary still come out similar.
**Real systems run on exactly this approach.** Elasticsearch ships BM25 under the hood - an improved TF-IDF. Google started with TF-IDF (PageRank added link analysis, but text search was TF-IDF).
Document-Term Matrix size scales with vocabulary. For 10,000 documents and 50,000 unique words - the matrix is 10,000 x 50,000 = **500 million cells**. The vast majority of them are zeros. How to store this efficiently? The answer comes in the next concept.
Cosine similarity between two documents = 0.92. What does this mean?
Sparse Matrices
Document-Term Matrix for a real corpus: 100,000 documents x 200,000 words = **20 billion** cells. At 8 bytes per float64 - that is 160 GB of RAM. But each document contains only ~200 unique words out of 200,000. **99.9% of the matrix is zeros**. Storing them is madness.
| Format | Full name | Fast operation | When to use |
|---|---|---|---|
| CSR | Compressed Sparse Row | Row slicing, matrix multiplication | After construction - for computations |
| CSC | Compressed Sparse Column | Column slicing | Column operations |
| COO | Coordinate | Building the matrix | During creation - then convert |
| LIL | List of Lists | Incremental filling | Building element by element |
**Never call `.toarray()` on large sparse matrices.** A 100K-document corpus can torch all RAM and kill the process. Every sklearn operation (fit, predict, cosine_similarity) works on sparse matrices directly.
**Rule of thumb:** if matrix density < 10%, sparse format saves memory. NLP matrices typically run at 0.1-2% density - savings of 50-1000x. Always check: `print(f"Density: {matrix.nnz / (matrix.shape[0] * matrix.shape[1]):.2%}")`.
Sparse matrices are not just about memory savings. Operations (multiplication, cosine similarity) on sparse data run faster too - zeros get skipped. On a 100K x 50K matrix with 0.1% density, sparse cosine_similarity runs **~100 times faster** than dense.
BoW and TF-IDF are outdated - in the era of BERT and GPT they are useless
BoW/TF-IDF remain practical tools for many tasks. Elasticsearch (the search engine behind Google, Amazon, Wikipedia) uses BM25 - an evolution of TF-IDF. For classifying short texts, TF-IDF + SVM often matches BERT while running 100-1000x faster and requiring no GPU.
Transformers win at tasks requiring deep context: translation, QA, dialogue. But for keyword search, spam filtering, topic classification - TF-IDF is sufficient and far cheaper. On production systems with millions of requests per second, the inference cost of BERT may be prohibitive.
Document-Term Matrix: 50,000 documents x 100,000 words. Density - 0.1%. How much memory is needed in dense vs sparse (float64, 8 bytes)?
Key Ideas
- **Bag of Words** turns text into a word frequency vector, losing order. "The dog bit the man" = "The man bit the dog" - remember this paradox? N-grams partially solve the problem
- **TF-IDF** weights words: common everywhere ("the") get weight ~0, unique to the document - high weight. Formula: TF x IDF
- **Document-Term Matrix** + **Cosine Similarity** = a simple but capable search engine. Google started with this
- **Sparse matrices** save 100-1000x memory for NLP data, where 99%+ elements are zeros
Related Topics
BoW and TF-IDF are the foundation for more advanced methods:
- Regular Expressions and Text — Preprocessing and text normalization before vectorization
- Word Embeddings — Next level: dense vectors (Word2Vec, GloVe) instead of sparse TF-IDF
Вопросы для размышления
- Why is cosine similarity better than Euclidean distance for comparing documents of different lengths?
- In what situations will BoW with bigrams be better than BoW with unigrams - and when worse?
- If TF-IDF is still used in 2026 (BM25 in Elasticsearch), why is it often called "outdated"?
Связанные уроки
- nlp-02 — Text preprocessing and regex come before vectorization
- nlp-04 — Word2Vec and GloVe are the next level after sparse TF-IDF
- ml-01 — TF-IDF vectors feed directly into classifiers like SVM and Naive Bayes
- it-01 — IDF weighting is a special case of information content per token
- alg-01 — Big-O intuition helps evaluate cost of sparse matrix operations
- prob-01-intro — TF-IDF plus Naive Bayes requires understanding conditional probabilities
- ml-01-intro