Natural Language Processing
Text Classification
Every piece of text a company receives - a support ticket, a review, a tweet - needs to be routed, prioritized, or analyzed at scale. Google processes 8.5 billion search queries per day, and each one triggers classifier chains that decide relevance, safety, and language. The same pipeline that routes a Gmail spam filter to 1.8 billion inboxes powers toxic content detection on YouTube. Mastering text classification means mastering the entry point to almost every NLP product in production.
- **Gmail spam filter** uses a Naive Bayes-rooted ensemble that blocks ~99.9% of spam for 1.8 billion users, processing over 100 billion messages per day - one of the largest text classifiers ever deployed.
- **Airbnb content moderation** uses a fine-tuned BERT model to classify listing descriptions and messages for policy violations, reducing human review load by 60% while improving recall on edge-case violations.
- **Bloomberg Terminal** classifies 400,000+ news articles per day by topic, company, and sentiment using LinearSVC pipelines - latency under 5ms per article is a hard business requirement for financial trading signals.
Предварительные знания
- Text vectorization: Bag of Words and TF-IDF
- Classification basics in ML: train/test splits, metrics
- Basic Python and working with scikit-learn
From Naive Bayes to fastText
Text classification became the first widely deployed applied NLP task. Early spam filters relied on the naive Bayes classifier, a simple probabilistic model that scored how typical each word was of spam. In 1998 Thorsten Joachims showed that support vector machines work very well on text because they handle the high-dimensional sparse features that TF-IDF produces. That made SVMs the standard for document classification for years. In 2016 a Facebook AI team led by Armand Joulin released fastText: the library trained a linear classifier over word and subword embeddings, reaching accuracy close to deep networks while running thousands of times faster. fastText made the case that for many tasks a simple, fast model beats a heavy one.
Naive Bayes
Naive Bayes applies Bayes' theorem with the 'naive' conditional independence assumption: every feature is treated as independent of every other feature given the class label. Despite this unrealistic assumption, it stays a strong baseline on text because word co-occurrence patterns are sparse and independence holds approximately at scale.
Multinomial Naive Bayes is the standard variant for text: it models word counts per document. For a new document, the classifier computes P(class | words) proportional to P(class) * product of P(word | class) for each word. All probabilities are log-summed to avoid underflow.
Laplace smoothing (alpha=1.0) prevents zero probabilities for words unseen in a class during training. Without it, a single unknown word zeros out the entire document probability for that class.
Why does Naive Bayes need Laplace smoothing when applied to text?
SVM for Text
Support Vector Machines find the maximum-margin hyperplane separating classes in feature space. For text classification with TF-IDF features, linear SVMs outperform kernel SVMs because text feature spaces are already high-dimensional and linearly separable - adding a kernel transformation provides no benefit and increases cost.
LinearSVC (liblinear) trains orders of magnitude faster than SVC with RBF kernel on text. The regularization parameter C controls the margin-error tradeoff: small C = wide margin, more misclassifications allowed; large C = narrow margin, fits training data closely. C=1.0 is a solid default for most text tasks.
sublinear_tf=True replaces raw term frequency tf with 1 + log(tf), compressing the effect of extremely frequent terms. This often improves SVM performance by 1-3% on news/review datasets.
Why is LinearSVC preferred over SVC(kernel='rbf') for text classification?
CNN for Text
Kim (2014) showed that a simple single-layer CNN achieves state-of-the-art on 7 of 8 classification benchmarks at the time. The key idea: 1-D convolutions over word embeddings with multiple filter widths (2, 3, 4 words) act as n-gram detectors. Max-over-time pooling then picks the most activated feature per filter, making the representation length-invariant.
Pre-training the embedding layer on Word2Vec/GloVe and keeping it frozen (CNN-static) usually outperforms random initialization when training data is small (< 50k examples). Fine-tuning embeddings (CNN-non-static) wins when data is abundant.
What is the role of max-over-time pooling in Kim's TextCNN?
Fine-Tuning for Classification
Fine-tuning a pretrained transformer (BERT, RoBERTa, DeBERTa) for text classification means adding a linear head on the [CLS] token and updating all weights jointly on labeled data. This consistently outperforms feature-based approaches because the full attention mechanism adapts to the target domain's linguistic patterns.
Key training recipe: learning rate 2e-5 to 5e-5, warmup over 10% of steps, batch size 16-32, 3-5 epochs. Using a smaller LR than standard supervised learning is critical - the pretrained weights encode general knowledge that should only shift gradually.
Fine-tuning always requires thousands of labeled examples to beat classical ML
With only 100-500 labeled examples, few-shot fine-tuning of RoBERTa-large often matches LinearSVC trained on 10x more data
Pretrained representations capture rich linguistic structure; the classifier head needs only a small signal to align that structure with the target labels
Why is the learning rate kept very small (2e-5 to 5e-5) when fine-tuning BERT?
Key Ideas
- **Naive Bayes** is the fastest and most interpretable baseline - train in seconds, beats random by wide margin, good for anomaly detection and initial benchmarking.
- **LinearSVC + TF-IDF** is the production workhorse for constrained environments: small data, strict latency budgets, or limited GPU - consistently 90%+ on clean datasets.
- **Fine-tuned transformers** (RoBERTa, DeBERTa) are the accuracy ceiling - 3-5% better than classical methods on most benchmarks, at the cost of 100-1000x more compute for inference.
Related Topics
Text classification connects to multiple areas of NLP and ML:
- TF-IDF and Bag of Words — Provides the sparse feature representations that Naive Bayes and SVM operate on
- BERT and Masked Language Models — The pretrained model that fine-tuning for classification builds on top of
Вопросы для размышления
- A company has 500 labeled customer support tickets and needs a classifier deployed this week with no GPU budget - which approach and why?
- How would the choice of classifier change if the labels are highly imbalanced (98% class A, 2% class B)?
- When fine-tuning BERT underperforms LinearSVC on a specific dataset, what are the most likely explanations?
Связанные уроки
- nlp-03 — TF-IDF features feed Naive Bayes and SVM classifiers
- nlp-12 — Fine-tuned BERT replaced classic text classifiers
- nlp-09 — Sentiment analysis is a special case of text classification
- ml-15-naive-bayes — Naive Bayes is the classic baseline classifier
- ml-13-svm — Linear SVM dominated text classification before deep learning
- prob-04-bayes — Bayes theorem underlies the Naive Bayes classifier
- ml-05-evaluation