Natural Language Processing

Bag of Words and TF-IDF

April 2004. Gmail launches in beta with 1 GB storage - while competitors offered 250 MB. One million users in a month. All of it held together by TF-IDF: a spam filter that decides every email's fate in 2 milliseconds, before the heavy neural nets even get a shot. An algorithm invented in the 1970s was protecting a billion inboxes in 2024.

**Elasticsearch** (BM25 = improved TF-IDF) - the search engine for Wikipedia, GitHub, Netflix, Stack Overflow
**Gmail spam filters** use TF-IDF + Naive Bayes as a first fast filter before heavy neural networks
**Amazon recommendation systems** compare TF-IDF of product descriptions for the "Similar items" section

Gerard Salton and the Invention of TF-IDF

Gerard Salton at Cornell University built SMART (System for the Mechanical Analysis and Retrieval of Text) - the first automated information retrieval system. TF-IDF crystallized out of this work in 1972. Salton wanted an answer to: how can a machine tell that a document about quantum mechanics is relevant to a query, while one about "the" is not? The answer turned out to be mathematically compact: penalize frequent words, reward rare ones. Fifty years later the algorithm sits inside Elasticsearch, chewing through trillions of queries per year.

Предварительные знания

Regular Expressions and Text

Bag of Words

**"The dog bit the man"** and **"The man bit the dog"** - for BoW these are IDENTICAL texts. Surprised? **Bag of Words** is one of the simplest yet effective text representations. The idea: ditch word order and just count how often each word appears.

**BoW's biggest casualty: word order.** "The dog bit the man" (news) and "The man bit the dog" (sensation) end up with identical vectors. For tasks where order is critical (translation, generation), BoW is a non-starter. For topic classification - it works great.

**Bigrams and trigrams** (n-grams) are a simple BoW extension that partially restores order. "not good" as a single token carries more information than "not" + "good" separately. But vocabulary size explodes.

**CountVectorizer** in sklearn handles tokenization, lowercasing, and frequency counting automatically. Knobs: `max_features` (cap vocabulary), `min_df`/`max_df` (filter by frequency), `stop_words` (drop stop words), `ngram_range` (n-grams).

Two documents: "cat eats fish" and "fish eats cat". Which statement is true for BoW (CountVectorizer)?

TF-IDF

The word "the" appears in every English document. The word "quantum" appears in maybe one in a thousand. Which word is more useful for search? **TF-IDF** (Term Frequency - Inverse Document Frequency) fixes BoW's central problem: not all words pull equal weight. Words that show up everywhere get low weight; rare and unique ones get high weight.

**TF-IDF formula:** `TF-IDF(t, d) = TF(t, d) x IDF(t)` - **TF** measures word importance *within* the document (local weight) - **IDF** measures word uniqueness *across the entire collection* (global weight) - Net result: high weight for words that are frequent in THIS document but rare elsewhere

Word	Appears in documents	IDF weight	Interpretation
"the"	9999 out of 10000	~0.0001	Useless for search
"machine"	500 out of 10000	~3.0	Medium informativeness
"quantum"	10 out of 10000	~6.9	Very informative
"supercalifragilistic"	1 out of 10000	~9.2	Unique document marker

**sklearn TfidfVectorizer** defaults to a smoothed IDF: `log((1+N)/(1+df)) + 1`. Prevents division by zero and produces slightly different numbers than the classical formula. Pass `smooth_idf=False` to turn smoothing off.

In a collection of 1000 documents, the word "algorithm" appears in 10 documents, and the word "and" in 999. Which word will have a higher TF-IDF weight (given the same TF)?

Document-Term Matrix

BoW and TF-IDF produce a **Document-Term Matrix** - a table where rows = documents, columns = terms, values = frequencies or TF-IDF weights. This matrix is the foundation for finding similar documents, clustering, and text classification.

**Cosine Similarity** measures the angle between two vectors: `cos(A, B) = (A . B) / (||A|| x ||B||)`. Values run from 0 (nothing in common) to 1 (identical). Unlike Euclidean distance, cosine similarity does not depend on document length - a long document and its summary still come out similar.

**Real systems run on exactly this approach.** Elasticsearch ships BM25 under the hood - an improved TF-IDF. Google started with TF-IDF (PageRank added link analysis, but text search was TF-IDF).

Document-Term Matrix size scales with vocabulary. For 10,000 documents and 50,000 unique words - the matrix is 10,000 x 50,000 = **500 million cells**. The vast majority of them are zeros. How to store this efficiently? The answer comes in the next concept.

Cosine similarity between two documents = 0.92. What does this mean?

Sparse Matrices

Document-Term Matrix for a real corpus: 100,000 documents x 200,000 words = **20 billion** cells. At 8 bytes per float64 - that is 160 GB of RAM. But each document contains only ~200 unique words out of 200,000. **99.9% of the matrix is zeros**. Storing them is madness.

Format	Full name	Fast operation	When to use
CSR	Compressed Sparse Row	Row slicing, matrix multiplication	After construction - for computations
CSC	Compressed Sparse Column	Column slicing	Column operations
COO	Coordinate	Building the matrix	During creation - then convert
LIL	List of Lists	Incremental filling	Building element by element

**Never call `.toarray()` on large sparse matrices.** A 100K-document corpus can torch all RAM and kill the process. Every sklearn operation (fit, predict, cosine_similarity) works on sparse matrices directly.

**Rule of thumb:** if matrix density < 10%, sparse format saves memory. NLP matrices typically run at 0.1-2% density - savings of 50-1000x. Always check: `print(f"Density: {matrix.nnz / (matrix.shape[0] * matrix.shape[1]):.2%}")`.

Sparse matrices are not just about memory savings. Operations (multiplication, cosine similarity) on sparse data run faster too - zeros get skipped. On a 100K x 50K matrix with 0.1% density, sparse cosine_similarity runs **~100 times faster** than dense.

BoW and TF-IDF are outdated - in the era of BERT and GPT they are useless

BoW/TF-IDF remain practical tools for many tasks. Elasticsearch (the search engine behind Google, Amazon, Wikipedia) uses BM25 - an evolution of TF-IDF. For classifying short texts, TF-IDF + SVM often matches BERT while running 100-1000x faster and requiring no GPU.

Transformers win at tasks requiring deep context: translation, QA, dialogue. But for keyword search, spam filtering, topic classification - TF-IDF is sufficient and far cheaper. On production systems with millions of requests per second, the inference cost of BERT may be prohibitive.

Document-Term Matrix: 50,000 documents x 100,000 words. Density - 0.1%. How much memory is needed in dense vs sparse (float64, 8 bytes)?

Key Ideas

**Bag of Words** turns text into a word frequency vector, losing order. "The dog bit the man" = "The man bit the dog" - remember this paradox? N-grams partially solve the problem
**TF-IDF** weights words: common everywhere ("the") get weight ~0, unique to the document - high weight. Formula: TF x IDF
**Document-Term Matrix** + **Cosine Similarity** = a simple but capable search engine. Google started with this
**Sparse matrices** save 100-1000x memory for NLP data, where 99%+ elements are zeros

Вопросы для размышления

Why is cosine similarity better than Euclidean distance for comparing documents of different lengths?
In what situations will BoW with bigrams be better than BoW with unigrams - and when worse?
If TF-IDF is still used in 2026 (BM25 in Elasticsearch), why is it often called "outdated"?

Связанные уроки

nlp-02 — Text preprocessing and regex come before vectorization
nlp-04 — Word2Vec and GloVe are the next level after sparse TF-IDF
ml-01 — TF-IDF vectors feed directly into classifiers like SVM and Naive Bayes
it-01 — IDF weighting is a special case of information content per token
alg-01 — Big-O intuition helps evaluate cost of sparse matrix operations
prob-01-intro — TF-IDF plus Naive Bayes requires understanding conditional probabilities
ml-01-intro

Natural Language Processing

Bag of Words and TF-IDF

**Elasticsearch** (BM25 = improved TF-IDF) - the search engine for Wikipedia, GitHub, Netflix, Stack Overflow
**Gmail spam filters** use TF-IDF + Naive Bayes as a first fast filter before heavy neural networks
**Amazon recommendation systems** compare TF-IDF of product descriptions for the "Similar items" section

Gerard Salton and the Invention of TF-IDF

Предварительные знания

Regular Expressions and Text

Bag of Words

Two documents: "cat eats fish" and "fish eats cat". Which statement is true for BoW (CountVectorizer)?

TF-IDF

Word	Appears in documents	IDF weight	Interpretation
"the"	9999 out of 10000	~0.0001	Useless for search
"machine"	500 out of 10000	~3.0	Medium informativeness
"quantum"	10 out of 10000	~6.9	Very informative
"supercalifragilistic"	1 out of 10000	~9.2	Unique document marker

In a collection of 1000 documents, the word "algorithm" appears in 10 documents, and the word "and" in 999. Which word will have a higher TF-IDF weight (given the same TF)?

Document-Term Matrix

**Real systems run on exactly this approach.** Elasticsearch ships BM25 under the hood - an improved TF-IDF. Google started with TF-IDF (PageRank added link analysis, but text search was TF-IDF).

Cosine similarity between two documents = 0.92. What does this mean?

Sparse Matrices

Format	Full name	Fast operation	When to use
CSR	Compressed Sparse Row	Row slicing, matrix multiplication	After construction - for computations
CSC	Compressed Sparse Column	Column slicing	Column operations
COO	Coordinate	Building the matrix	During creation - then convert
LIL	List of Lists	Incremental filling	Building element by element

BoW and TF-IDF are outdated - in the era of BERT and GPT they are useless

Document-Term Matrix: 50,000 documents x 100,000 words. Density - 0.1%. How much memory is needed in dense vs sparse (float64, 8 bytes)?

Key Ideas

**Bag of Words** turns text into a word frequency vector, losing order. "The dog bit the man" = "The man bit the dog" - remember this paradox? N-grams partially solve the problem
**TF-IDF** weights words: common everywhere ("the") get weight ~0, unique to the document - high weight. Formula: TF x IDF
**Document-Term Matrix** + **Cosine Similarity** = a simple but capable search engine. Google started with this
**Sparse matrices** save 100-1000x memory for NLP data, where 99%+ elements are zeros

Вопросы для размышления

Why is cosine similarity better than Euclidean distance for comparing documents of different lengths?
In what situations will BoW with bigrams be better than BoW with unigrams - and when worse?
If TF-IDF is still used in 2026 (BM25 in Elasticsearch), why is it often called "outdated"?

Связанные уроки

nlp-02 — Text preprocessing and regex come before vectorization
nlp-04 — Word2Vec and GloVe are the next level after sparse TF-IDF
ml-01 — TF-IDF vectors feed directly into classifiers like SVM and Naive Bayes
it-01 — IDF weighting is a special case of information content per token
alg-01 — Big-O intuition helps evaluate cost of sparse matrix operations
prob-01-intro — TF-IDF plus Naive Bayes requires understanding conditional probabilities
ml-01-intro

Bag of Words and TF-IDF

Gerard Salton and the Invention of TF-IDF

Предварительные знания

Bag of Words

TF-IDF

Document-Term Matrix

Sparse Matrices

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Bag of Words and TF-IDF

Gerard Salton and the Invention of TF-IDF

Предварительные знания

Bag of Words

TF-IDF

Document-Term Matrix

Sparse Matrices

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки