Natural Language Processing

Introduction to NLP

ChatGPT doesn't see the word "don't". It sees tokens [Don, 'n't] - and that changes everything. The model processes negation differently than intuition suggests. OpenAI's tiktoken: a vocabulary of 100277 tokens, not 26 letters. The first 10 years of NLP broke down exactly here - on the assumption that word = unit of meaning. Chomsky built universal grammar theory in 1956. Modern transformers achieve better translation quality with no grammatical rules at all - just tokens and statistics on terabytes of text.

**GPT tokenization:** tiktoken splits text into 100277 possible tokens. The word "unbelievable" → ["un", "bel", "ievable"]. That is why LLMs count tokens, not words - and why API pricing is in $/1M tokens, not $/1M words
**BERT WordPiece and DeepL:** DeepL neural translation with 1B+ parameters runs on subword tokenization. Splitting rare words into parts lets the model generalize to words it never saw during training
**Sentiment analysis at Bloomberg:** financial NLP analyzes thousands of news items per second. One token "not" before "rose" - and the signal inverts. Preprocessing determines whether the model correctly handles negation
**Named Entity Recognition:** Bloomberg Terminal extracts company names, tickers, and people from unstructured news - NER in production with latency requirements under 100 ms
**spaCy and HuggingFace:** spaCy processes one million tokens per second on a single CPU. HuggingFace Transformers gives BERT, GPT-2, RoBERTa in three lines - what used to take months of engineering

Предварительные знания

Basic Python: strings, lists, loops, importing libraries
A sense of what a machine learning model and a training set are
Command line: installing packages with pip, running scripts

From the Georgetown experiment to the embeddings era

NLP starts with the Georgetown-IBM experiment of 1954, when IBM and Georgetown University publicly translated more than sixty Russian sentences into English on an IBM 701. The press predicted automatic translation within five years, and the real problem turned out to be far harder. In 1966 Joseph Weizenbaum at MIT wrote ELIZA, a program that mimicked a psychotherapist through simple pattern matching. People read understanding into ELIZA that was never there, and Weizenbaum himself was disturbed by how readily they did so. For decades NLP ran on hand-written rules and grammars. The turning point came in the 1990s with the statistical revolution, which replaced rules with probabilistic models trained on large text corpora. The next jump arrived in 2013 with word2vec and dense embeddings, which opened the deep learning era for language.

Tokenization

ChatGPT doesn't see the word "don't". The model sees tokens - and how exactly "don't" is split determines the quality of negation handling. NLTK gives ["ca", "n't"] for "can't" - negation is isolated. A naive split() gives ["can't"] - negation is fused with the auxiliary verb. **Tokenization** is the first and most important step in NLP: splitting text into minimal units (tokens) that the model actually processes.

**A token** is the atomic unit of text that a model works with. It can be a word, part of a word, a character, or even a byte. The choice of tokenization strategy determines what information the model will be able to extract from the text.

Strategy	Input: "unbelievable"	Result	Pros / Cons
Word-level	unbelievable	["unbelievable"]	Preserves meaning / Huge vocabulary
Subword (BPE)	unbelievable	["un", "believ", "able"]	Balanced / Standard in LLMs
Character-level	unbelievable	["u","n","b","e",...]	Small vocabulary / Long sequences

**Not all languages use spaces!** In Chinese (我喜欢猫) and Japanese (私は猫が好き) there are no spaces between words. In German, compound words like "Donaudampfschiffahrtsgesellschaftskapitän" are a single "word". There is no universal tokenization.

**Modern LLMs** (GPT, BERT, LLaMA) use **subword tokenization** (BPE or SentencePiece). Common words remain whole ("the" → ["the"]), rare ones are split into parts ("tokenization" → ["token", "ization"]). This allows working with any words, even misspellings.

What result will NLTK word_tokenize give for the string "I can't"?

Preprocessing

After tokenization, the text is still "dirty": casing, punctuation, stop words, different Unicode forms. **Preprocessing** is a series of transformations that brings tokens to a uniform form before feeding them to a model. Each step is a deliberate choice, not a mindless ritual.

**Stop words** are high-frequency words that carry little semantic meaning: "the", "is", "at", "on", "a". In English there are ~180 of them, in Russian ~250. Removing stop words reduces data size and lowers noise.

**Don't remove stop words blindly!** For sentiment analysis, the phrase "not good" without stop words becomes simply "good" - the meaning is inverted! For search engines and BoW it's useful to remove them. For BERT and GPT - unnecessary, the model accounts for context on its own.

Step	When to apply	When NOT to apply
Lowercasing	Search, classification	NER ("Apple" the company vs "apple" the fruit)
Remove punctuation	BoW, TF-IDF	Sentiment ("!!!" = strong emotion)
Stop words	Search, topic modeling	Translation, QA systems
Unicode NFC	Always	Almost no exceptions

**Order of preprocessing matters.** Removing punctuation before tokenization turns "don't" into "dont" - one meaningless token. Correct order: tokenization → lowercasing → remove punctuation → stop words.

Why can removing stop words hurt a sentiment analysis task?

NLP Pipeline

Individual steps - tokenization, normalization, vectorization - are combined into an **NLP pipeline**: a sequential chain of transformations from raw text to model output. A conveyor where the output of one step is the input of the next. The critical implication: the vectorizer is fit only on training data. Fit on all data and the model "peeks" at test statistics - silent data leakage.

**Why use a pipeline?** Three reasons: 1. **Reproducibility** - all steps are fixed. 2. **No data leakage** - vectorizer is trained only on train, not on test. 3. **Batch processing** - a single `pipeline.predict()` call processes thousands of texts.

**spaCy vs NLTK:** NLTK is a set of individual tools (like Lego bricks). spaCy is a ready-made pipeline (like an assembled model). For production, spaCy or HuggingFace Transformers are usually chosen. For learning and experiments - NLTK.

Modern Transformer-based pipelines (HuggingFace) are even simpler - the model handles tokenization and preprocessing itself, three lines of code. But understanding each step is critical for debugging: when BERT gives strange results, the cause is almost always a preprocessing mismatch or a tokenizer that doesn't correspond to the model.

What happens when TfidfVectorizer is trained on all data (train + test) instead of just train?

NLP Tasks

NLP is not one task but a whole family. Each frames the question differently: "what emotion does the text convey?", "who is mentioned in the document?", "how to translate this into another language?". HuggingFace Transformers solves each in three lines - but knowing the difference between tasks is essential for choosing the right one.

Task	Input → Output	Example application
Text Classification	Text → Category	Spam filter: email → spam/not spam
Sentiment Analysis	Text → Emotion	Reviews: "Great product!" → Positive
Named Entity Recognition	Text → Entities	"Apple in Cupertino" → Apple=ORG, Cupertino=LOC
Machine Translation	Text (lang A) → Text (lang B)	Google Translate
Question Answering	Question + Context → Answer	ChatGPT, search engines
Summarization	Long text → Short text	Compressing news into 2-3 sentences
Text Generation	Prompt → Text	GPT-4, copywriting, code

**NER (Named Entity Recognition)** is one of the most practical NLP tasks. It allows automatic extraction of people's names (PER), organizations (ORG), locations (LOC), dates (DATE), and other entities from unstructured text. Used in search engines, CRM, legal document analysis.

NLP tasks group by context level: token level - POS-tagging, NER; sentence level - classification, sentiment; document level - summarization, topic modeling; across texts - translation, QA, semantic similarity. The higher the level, the more context the model needs, and the more expensive the inference.

**Start with the HuggingFace pipeline.** For a prototype or MVP, three lines of code are enough. When customization becomes necessary (domain-specific data, another language, latency constraints) - then dig into model training.

NLP is just a set of regular expressions and rules for text processing

Modern NLP is based on deep learning (Transformers, BERT, GPT). Regex only solves the simplest tasks like email validation. Understanding meaning, context, irony, and ambiguity requires neural networks trained on billions of texts.

Regular expressions cannot handle context and semantics. The phrase "bank of a river" and "bank issued a loan" cannot be distinguished by rules - a model that understands the surrounding context of each word is required.

Which NLP task solves the problem: "extract all mentioned companies and dates from a 10-page contract"?

Key Ideas

**Tokenization** - splitting text into minimal units. The strategy (word/subword/char) determines what the model actually sees. ChatGPT sees [Don, 'n't], not the word "don't"
**Subword (BPE/WordPiece)** - the standard in LLMs: frequent words stay whole, rare ones are split. tiktoken: 100277 tokens, not 26 letters
**Preprocessing** - text normalization. Every step is deliberate: removing stop words breaks sentiment ("not good" → "good"), Unicode NFC is always needed
**Pipeline** combines all steps into a reproducible chain. Fit only on train - otherwise data leakage
**Chomsky 1956 → tiktoken:** universal grammar theory lost to statistics at scale. Transformers translate better without a single grammatical rule

Вопросы для размышления

tiktoken splits "tokenization" into ["token", "ization"]. Why is subword better than word-level for rare words and misspellings?
The phrase "not bad" after stop word removal becomes "bad". What result will a sentiment model trained with removed stop words give?
BERT uses WordPiece, GPT-4 uses BPE (tiktoken). Both are subword algorithms, but fundamentally different. What do they share and what is the key distinction?

Связанные уроки

nlp-02 — Next step: word vector representations and embeddings
ml-01-intro — ML fundamentals for training language models
fl-01-intro — Formal languages: grammars and parsing
prob-01-intro — Probabilities for language modeling
aie-03-llm-fundamentals — NLP pipelines inside modern LLMs
ir-01 — Information retrieval as an NLP application

Natural Language Processing

Introduction to NLP

**GPT tokenization:** tiktoken splits text into 100277 possible tokens. The word "unbelievable" → ["un", "bel", "ievable"]. That is why LLMs count tokens, not words - and why API pricing is in $/1M tokens, not $/1M words
**BERT WordPiece and DeepL:** DeepL neural translation with 1B+ parameters runs on subword tokenization. Splitting rare words into parts lets the model generalize to words it never saw during training
**Sentiment analysis at Bloomberg:** financial NLP analyzes thousands of news items per second. One token "not" before "rose" - and the signal inverts. Preprocessing determines whether the model correctly handles negation
**Named Entity Recognition:** Bloomberg Terminal extracts company names, tickers, and people from unstructured news - NER in production with latency requirements under 100 ms
**spaCy and HuggingFace:** spaCy processes one million tokens per second on a single CPU. HuggingFace Transformers gives BERT, GPT-2, RoBERTa in three lines - what used to take months of engineering

Предварительные знания

Basic Python: strings, lists, loops, importing libraries
A sense of what a machine learning model and a training set are
Command line: installing packages with pip, running scripts

From the Georgetown experiment to the embeddings era

Tokenization

Strategy	Input: "unbelievable"	Result	Pros / Cons
Word-level	unbelievable	["unbelievable"]	Preserves meaning / Huge vocabulary
Subword (BPE)	unbelievable	["un", "believ", "able"]	Balanced / Standard in LLMs
Character-level	unbelievable	["u","n","b","e",...]	Small vocabulary / Long sequences

What result will NLTK word_tokenize give for the string "I can't"?

Preprocessing

Step	When to apply	When NOT to apply
Lowercasing	Search, classification	NER ("Apple" the company vs "apple" the fruit)
Remove punctuation	BoW, TF-IDF	Sentiment ("!!!" = strong emotion)
Stop words	Search, topic modeling	Translation, QA systems
Unicode NFC	Always	Almost no exceptions

Why can removing stop words hurt a sentiment analysis task?

NLP Pipeline

What happens when TfidfVectorizer is trained on all data (train + test) instead of just train?

NLP Tasks

Task	Input → Output	Example application
Text Classification	Text → Category	Spam filter: email → spam/not spam
Sentiment Analysis	Text → Emotion	Reviews: "Great product!" → Positive
Named Entity Recognition	Text → Entities	"Apple in Cupertino" → Apple=ORG, Cupertino=LOC
Machine Translation	Text (lang A) → Text (lang B)	Google Translate
Question Answering	Question + Context → Answer	ChatGPT, search engines
Summarization	Long text → Short text	Compressing news into 2-3 sentences
Text Generation	Prompt → Text	GPT-4, copywriting, code

NLP is just a set of regular expressions and rules for text processing

Which NLP task solves the problem: "extract all mentioned companies and dates from a 10-page contract"?

Key Ideas

**Tokenization** - splitting text into minimal units. The strategy (word/subword/char) determines what the model actually sees. ChatGPT sees [Don, 'n't], not the word "don't"
**Subword (BPE/WordPiece)** - the standard in LLMs: frequent words stay whole, rare ones are split. tiktoken: 100277 tokens, not 26 letters
**Preprocessing** - text normalization. Every step is deliberate: removing stop words breaks sentiment ("not good" → "good"), Unicode NFC is always needed
**Pipeline** combines all steps into a reproducible chain. Fit only on train - otherwise data leakage
**Chomsky 1956 → tiktoken:** universal grammar theory lost to statistics at scale. Transformers translate better without a single grammatical rule

Вопросы для размышления

tiktoken splits "tokenization" into ["token", "ization"]. Why is subword better than word-level for rare words and misspellings?
The phrase "not bad" after stop word removal becomes "bad". What result will a sentiment model trained with removed stop words give?
BERT uses WordPiece, GPT-4 uses BPE (tiktoken). Both are subword algorithms, but fundamentally different. What do they share and what is the key distinction?

Связанные уроки

nlp-02 — Next step: word vector representations and embeddings
ml-01-intro — ML fundamentals for training language models
fl-01-intro — Formal languages: grammars and parsing
prob-01-intro — Probabilities for language modeling
aie-03-llm-fundamentals — NLP pipelines inside modern LLMs
ir-01 — Information retrieval as an NLP application

Introduction to NLP

Предварительные знания

From the Georgetown experiment to the embeddings era

Tokenization

Preprocessing

NLP Pipeline

NLP Tasks

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Introduction to NLP

Предварительные знания

From the Georgetown experiment to the embeddings era

Tokenization

Preprocessing

NLP Pipeline

NLP Tasks

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки