Natural Language Processing
Introduction to NLP
ChatGPT doesn't see the word "don't". It sees tokens [Don, 'n't] - and that changes everything. The model processes negation differently than intuition suggests. OpenAI's tiktoken: a vocabulary of 100277 tokens, not 26 letters. The first 10 years of NLP broke down exactly here - on the assumption that word = unit of meaning. Chomsky built universal grammar theory in 1956. Modern transformers achieve better translation quality with no grammatical rules at all - just tokens and statistics on terabytes of text.
- **GPT tokenization:** tiktoken splits text into 100277 possible tokens. The word "unbelievable" → ["un", "bel", "ievable"]. That is why LLMs count tokens, not words - and why API pricing is in $/1M tokens, not $/1M words
- **BERT WordPiece and DeepL:** DeepL neural translation with 1B+ parameters runs on subword tokenization. Splitting rare words into parts lets the model generalize to words it never saw during training
- **Sentiment analysis at Bloomberg:** financial NLP analyzes thousands of news items per second. One token "not" before "rose" - and the signal inverts. Preprocessing determines whether the model correctly handles negation
- **Named Entity Recognition:** Bloomberg Terminal extracts company names, tickers, and people from unstructured news - NER in production with latency requirements under 100 ms
- **spaCy and HuggingFace:** spaCy processes one million tokens per second on a single CPU. HuggingFace Transformers gives BERT, GPT-2, RoBERTa in three lines - what used to take months of engineering
Предварительные знания
- Basic Python: strings, lists, loops, importing libraries
- A sense of what a machine learning model and a training set are
- Command line: installing packages with pip, running scripts
From the Georgetown experiment to the embeddings era
NLP starts with the Georgetown-IBM experiment of 1954, when IBM and Georgetown University publicly translated more than sixty Russian sentences into English on an IBM 701. The press predicted automatic translation within five years, and the real problem turned out to be far harder. In 1966 Joseph Weizenbaum at MIT wrote ELIZA, a program that mimicked a psychotherapist through simple pattern matching. People read understanding into ELIZA that was never there, and Weizenbaum himself was disturbed by how readily they did so. For decades NLP ran on hand-written rules and grammars. The turning point came in the 1990s with the statistical revolution, which replaced rules with probabilistic models trained on large text corpora. The next jump arrived in 2013 with word2vec and dense embeddings, which opened the deep learning era for language.
Tokenization
ChatGPT doesn't see the word "don't". The model sees tokens - and how exactly "don't" is split determines the quality of negation handling. NLTK gives ["ca", "n't"] for "can't" - negation is isolated. A naive split() gives ["can't"] - negation is fused with the auxiliary verb. **Tokenization** is the first and most important step in NLP: splitting text into minimal units (tokens) that the model actually processes.
**A token** is the atomic unit of text that a model works with. It can be a word, part of a word, a character, or even a byte. The choice of tokenization strategy determines what information the model will be able to extract from the text.
| Strategy | Input: "unbelievable" | Result | Pros / Cons |
|---|---|---|---|
| Word-level | unbelievable | ["unbelievable"] | Preserves meaning / Huge vocabulary |
| Subword (BPE) | unbelievable | ["un", "believ", "able"] | Balanced / Standard in LLMs |
| Character-level | unbelievable | ["u","n","b","e",...] | Small vocabulary / Long sequences |
**Not all languages use spaces!** In Chinese (我喜欢猫) and Japanese (私は猫が好き) there are no spaces between words. In German, compound words like "Donaudampfschiffahrtsgesellschaftskapitän" are a single "word". There is no universal tokenization.
**Modern LLMs** (GPT, BERT, LLaMA) use **subword tokenization** (BPE or SentencePiece). Common words remain whole ("the" → ["the"]), rare ones are split into parts ("tokenization" → ["token", "ization"]). This allows working with any words, even misspellings.
What result will NLTK word_tokenize give for the string "I can't"?
Preprocessing
After tokenization, the text is still "dirty": casing, punctuation, stop words, different Unicode forms. **Preprocessing** is a series of transformations that brings tokens to a uniform form before feeding them to a model. Each step is a deliberate choice, not a mindless ritual.
**Stop words** are high-frequency words that carry little semantic meaning: "the", "is", "at", "on", "a". In English there are ~180 of them, in Russian ~250. Removing stop words reduces data size and lowers noise.
**Don't remove stop words blindly!** For sentiment analysis, the phrase "not good" without stop words becomes simply "good" - the meaning is inverted! For search engines and BoW it's useful to remove them. For BERT and GPT - unnecessary, the model accounts for context on its own.
| Step | When to apply | When NOT to apply |
|---|---|---|
| Lowercasing | Search, classification | NER ("Apple" the company vs "apple" the fruit) |
| Remove punctuation | BoW, TF-IDF | Sentiment ("!!!" = strong emotion) |
| Stop words | Search, topic modeling | Translation, QA systems |
| Unicode NFC | Always | Almost no exceptions |
**Order of preprocessing matters.** Removing punctuation before tokenization turns "don't" into "dont" - one meaningless token. Correct order: tokenization → lowercasing → remove punctuation → stop words.
Why can removing stop words hurt a sentiment analysis task?
NLP Pipeline
Individual steps - tokenization, normalization, vectorization - are combined into an **NLP pipeline**: a sequential chain of transformations from raw text to model output. A conveyor where the output of one step is the input of the next. The critical implication: the vectorizer is fit only on training data. Fit on all data and the model "peeks" at test statistics - silent data leakage.
**Why use a pipeline?** Three reasons: 1. **Reproducibility** - all steps are fixed. 2. **No data leakage** - vectorizer is trained only on train, not on test. 3. **Batch processing** - a single `pipeline.predict()` call processes thousands of texts.
**spaCy vs NLTK:** NLTK is a set of individual tools (like Lego bricks). spaCy is a ready-made pipeline (like an assembled model). For production, spaCy or HuggingFace Transformers are usually chosen. For learning and experiments - NLTK.
Modern Transformer-based pipelines (HuggingFace) are even simpler - the model handles tokenization and preprocessing itself, three lines of code. But understanding each step is critical for debugging: when BERT gives strange results, the cause is almost always a preprocessing mismatch or a tokenizer that doesn't correspond to the model.
What happens when TfidfVectorizer is trained on all data (train + test) instead of just train?
NLP Tasks
NLP is not one task but a whole family. Each frames the question differently: "what emotion does the text convey?", "who is mentioned in the document?", "how to translate this into another language?". HuggingFace Transformers solves each in three lines - but knowing the difference between tasks is essential for choosing the right one.
| Task | Input → Output | Example application |
|---|---|---|
| Text Classification | Text → Category | Spam filter: email → spam/not spam |
| Sentiment Analysis | Text → Emotion | Reviews: "Great product!" → Positive |
| Named Entity Recognition | Text → Entities | "Apple in Cupertino" → Apple=ORG, Cupertino=LOC |
| Machine Translation | Text (lang A) → Text (lang B) | Google Translate |
| Question Answering | Question + Context → Answer | ChatGPT, search engines |
| Summarization | Long text → Short text | Compressing news into 2-3 sentences |
| Text Generation | Prompt → Text | GPT-4, copywriting, code |
**NER (Named Entity Recognition)** is one of the most practical NLP tasks. It allows automatic extraction of people's names (PER), organizations (ORG), locations (LOC), dates (DATE), and other entities from unstructured text. Used in search engines, CRM, legal document analysis.
NLP tasks group by context level: token level - POS-tagging, NER; sentence level - classification, sentiment; document level - summarization, topic modeling; across texts - translation, QA, semantic similarity. The higher the level, the more context the model needs, and the more expensive the inference.
**Start with the HuggingFace pipeline.** For a prototype or MVP, three lines of code are enough. When customization becomes necessary (domain-specific data, another language, latency constraints) - then dig into model training.
NLP is just a set of regular expressions and rules for text processing
Modern NLP is based on deep learning (Transformers, BERT, GPT). Regex only solves the simplest tasks like email validation. Understanding meaning, context, irony, and ambiguity requires neural networks trained on billions of texts.
Regular expressions cannot handle context and semantics. The phrase "bank of a river" and "bank issued a loan" cannot be distinguished by rules - a model that understands the surrounding context of each word is required.
Which NLP task solves the problem: "extract all mentioned companies and dates from a 10-page contract"?
Key Ideas
- **Tokenization** - splitting text into minimal units. The strategy (word/subword/char) determines what the model actually sees. ChatGPT sees [Don, 'n't], not the word "don't"
- **Subword (BPE/WordPiece)** - the standard in LLMs: frequent words stay whole, rare ones are split. tiktoken: 100277 tokens, not 26 letters
- **Preprocessing** - text normalization. Every step is deliberate: removing stop words breaks sentiment ("not good" → "good"), Unicode NFC is always needed
- **Pipeline** combines all steps into a reproducible chain. Fit only on train - otherwise data leakage
- **Chomsky 1956 → tiktoken:** universal grammar theory lost to statistics at scale. Transformers translate better without a single grammatical rule
Related Topics
NLP - at the intersection of linguistics and machine learning:
- Regular Expressions and Text — Tools for working with text at a low level - when a full pipeline is overkill
- Bag of Words and TF-IDF — Vectorization - the key pipeline step, turning tokens into numbers for the model
Вопросы для размышления
- tiktoken splits "tokenization" into ["token", "ization"]. Why is subword better than word-level for rare words and misspellings?
- The phrase "not bad" after stop word removal becomes "bad". What result will a sentiment model trained with removed stop words give?
- BERT uses WordPiece, GPT-4 uses BPE (tiktoken). Both are subword algorithms, but fundamentally different. What do they share and what is the key distinction?
Связанные уроки
- nlp-02 — Next step: word vector representations and embeddings
- ml-01-intro — ML fundamentals for training language models
- fl-01-intro — Formal languages: grammars and parsing
- prob-01-intro — Probabilities for language modeling
- aie-03-llm-fundamentals — NLP pipelines inside modern LLMs
- ir-01 — Information retrieval as an NLP application