Natural Language Processing

Bag of Words and TF-IDF

April 2004. Gmail launches in beta with 1 GB storage - while competitors offered 250 MB. One million users in a month. All of it held together by TF-IDF: a spam filter that decides every email's fate in 2 milliseconds, before the heavy neural nets even get a shot. An algorithm invented in the 1970s was protecting a billion inboxes in 2024.

  • **Elasticsearch** (BM25 = improved TF-IDF) - the search engine for Wikipedia, GitHub, Netflix, Stack Overflow
  • **Gmail spam filters** use TF-IDF + Naive Bayes as a first fast filter before heavy neural networks
  • **Amazon recommendation systems** compare TF-IDF of product descriptions for the "Similar items" section

Gerard Salton and the Invention of TF-IDF

Gerard Salton at Cornell University built SMART (System for the Mechanical Analysis and Retrieval of Text) - the first automated information retrieval system. TF-IDF crystallized out of this work in 1972. Salton wanted an answer to: how can a machine tell that a document about quantum mechanics is relevant to a query, while one about "the" is not? The answer turned out to be mathematically compact: penalize frequent words, reward rare ones. Fifty years later the algorithm sits inside Elasticsearch, chewing through trillions of queries per year.

Предварительные знания

  • Regular Expressions and Text

Bag of Words

**"The dog bit the man"** and **"The man bit the dog"** - for BoW these are IDENTICAL texts. Surprised? **Bag of Words** is one of the simplest yet effective text representations. The idea: ditch word order and just count how often each word appears.

**BoW's biggest casualty: word order.** "The dog bit the man" (news) and "The man bit the dog" (sensation) end up with identical vectors. For tasks where order is critical (translation, generation), BoW is a non-starter. For topic classification - it works great.

**Bigrams and trigrams** (n-grams) are a simple BoW extension that partially restores order. "not good" as a single token carries more information than "not" + "good" separately. But vocabulary size explodes.

**CountVectorizer** in sklearn handles tokenization, lowercasing, and frequency counting automatically. Knobs: `max_features` (cap vocabulary), `min_df`/`max_df` (filter by frequency), `stop_words` (drop stop words), `ngram_range` (n-grams).

Two documents: "cat eats fish" and "fish eats cat". Which statement is true for BoW (CountVectorizer)?

TF-IDF

The word "the" appears in every English document. The word "quantum" appears in maybe one in a thousand. Which word is more useful for search? **TF-IDF** (Term Frequency - Inverse Document Frequency) fixes BoW's central problem: not all words pull equal weight. Words that show up everywhere get low weight; rare and unique ones get high weight.

**TF-IDF formula:** `TF-IDF(t, d) = TF(t, d) x IDF(t)` - **TF** measures word importance *within* the document (local weight) - **IDF** measures word uniqueness *across the entire collection* (global weight) - Net result: high weight for words that are frequent in THIS document but rare elsewhere

WordAppears in documentsIDF weightInterpretation
"the"9999 out of 10000~0.0001Useless for search
"machine"500 out of 10000~3.0Medium informativeness
"quantum"10 out of 10000~6.9Very informative
"supercalifragilistic"1 out of 10000~9.2Unique document marker

**sklearn TfidfVectorizer** defaults to a smoothed IDF: `log((1+N)/(1+df)) + 1`. Prevents division by zero and produces slightly different numbers than the classical formula. Pass `smooth_idf=False` to turn smoothing off.

In a collection of 1000 documents, the word "algorithm" appears in 10 documents, and the word "and" in 999. Which word will have a higher TF-IDF weight (given the same TF)?

Document-Term Matrix

BoW and TF-IDF produce a **Document-Term Matrix** - a table where rows = documents, columns = terms, values = frequencies or TF-IDF weights. This matrix is the foundation for finding similar documents, clustering, and text classification.

**Cosine Similarity** measures the angle between two vectors: `cos(A, B) = (A . B) / (||A|| x ||B||)`. Values run from 0 (nothing in common) to 1 (identical). Unlike Euclidean distance, cosine similarity does not depend on document length - a long document and its summary still come out similar.

**Real systems run on exactly this approach.** Elasticsearch ships BM25 under the hood - an improved TF-IDF. Google started with TF-IDF (PageRank added link analysis, but text search was TF-IDF).

Document-Term Matrix size scales with vocabulary. For 10,000 documents and 50,000 unique words - the matrix is 10,000 x 50,000 = **500 million cells**. The vast majority of them are zeros. How to store this efficiently? The answer comes in the next concept.

Cosine similarity between two documents = 0.92. What does this mean?

Sparse Matrices

Document-Term Matrix for a real corpus: 100,000 documents x 200,000 words = **20 billion** cells. At 8 bytes per float64 - that is 160 GB of RAM. But each document contains only ~200 unique words out of 200,000. **99.9% of the matrix is zeros**. Storing them is madness.

FormatFull nameFast operationWhen to use
CSRCompressed Sparse RowRow slicing, matrix multiplicationAfter construction - for computations
CSCCompressed Sparse ColumnColumn slicingColumn operations
COOCoordinateBuilding the matrixDuring creation - then convert
LILList of ListsIncremental fillingBuilding element by element

**Never call `.toarray()` on large sparse matrices.** A 100K-document corpus can torch all RAM and kill the process. Every sklearn operation (fit, predict, cosine_similarity) works on sparse matrices directly.

**Rule of thumb:** if matrix density < 10%, sparse format saves memory. NLP matrices typically run at 0.1-2% density - savings of 50-1000x. Always check: `print(f"Density: {matrix.nnz / (matrix.shape[0] * matrix.shape[1]):.2%}")`.

Sparse matrices are not just about memory savings. Operations (multiplication, cosine similarity) on sparse data run faster too - zeros get skipped. On a 100K x 50K matrix with 0.1% density, sparse cosine_similarity runs **~100 times faster** than dense.

BoW and TF-IDF are outdated - in the era of BERT and GPT they are useless

BoW/TF-IDF remain practical tools for many tasks. Elasticsearch (the search engine behind Google, Amazon, Wikipedia) uses BM25 - an evolution of TF-IDF. For classifying short texts, TF-IDF + SVM often matches BERT while running 100-1000x faster and requiring no GPU.

Transformers win at tasks requiring deep context: translation, QA, dialogue. But for keyword search, spam filtering, topic classification - TF-IDF is sufficient and far cheaper. On production systems with millions of requests per second, the inference cost of BERT may be prohibitive.

Document-Term Matrix: 50,000 documents x 100,000 words. Density - 0.1%. How much memory is needed in dense vs sparse (float64, 8 bytes)?

Key Ideas

  • **Bag of Words** turns text into a word frequency vector, losing order. "The dog bit the man" = "The man bit the dog" - remember this paradox? N-grams partially solve the problem
  • **TF-IDF** weights words: common everywhere ("the") get weight ~0, unique to the document - high weight. Formula: TF x IDF
  • **Document-Term Matrix** + **Cosine Similarity** = a simple but capable search engine. Google started with this
  • **Sparse matrices** save 100-1000x memory for NLP data, where 99%+ elements are zeros

Related Topics

BoW and TF-IDF are the foundation for more advanced methods:

  • Regular Expressions and Text — Preprocessing and text normalization before vectorization
  • Word Embeddings — Next level: dense vectors (Word2Vec, GloVe) instead of sparse TF-IDF

Вопросы для размышления

  • Why is cosine similarity better than Euclidean distance for comparing documents of different lengths?
  • In what situations will BoW with bigrams be better than BoW with unigrams - and when worse?
  • If TF-IDF is still used in 2026 (BM25 in Elasticsearch), why is it often called "outdated"?

Связанные уроки

  • nlp-02 — Text preprocessing and regex come before vectorization
  • nlp-04 — Word2Vec and GloVe are the next level after sparse TF-IDF
  • ml-01 — TF-IDF vectors feed directly into classifiers like SVM and Naive Bayes
  • it-01 — IDF weighting is a special case of information content per token
  • alg-01 — Big-O intuition helps evaluate cost of sparse matrix operations
  • prob-01-intro — TF-IDF plus Naive Bayes requires understanding conditional probabilities
  • ml-01-intro
Bag of Words and TF-IDF

0

1

Sign In