Generative AI

Tokenization: BPE, SentencePiece

April 2024. Meta releases LLaMA 3 and bumps the vocabulary from 32K to 128K tokens. Not marketing - fixing a fundamental bug. The previous version burned 3x more tokens on Russian and Arabic text than on English. Same request, 3x more expensive, 3x slower. Tokenization decides what the model actually sees.

**GPT-4 API** charges by tokens: the same question in French costs 2-3x more than in English - due to less efficient tokenization of non-English text
**BERT** uses WordPiece with a 30K vocabulary and ## marker for continuations - that's how the model understands morphology: "play" + "##ing" = "playing"
**LLaMA 3** increased the vocabulary from 32K to 128K specifically for better multilingual tokenization - and quality on non-English languages improved noticeably

Rico Sennrich and BPE Adaptation for NLP

In 1994 Philip Gage invented BPE as a data compression algorithm - replacing frequent byte pairs with a single byte. In 2016 Rico Sennrich at the University of Edinburgh adapted the idea for neural machine translation. The problem was simple: how to translate "Bezirksregierungen" (German, 18 characters) when it does not exist in the vocabulary? BPE shatters it into known pieces. The Sennrich et al. paper racked up 10,000 citations in 8 years - one of the most influential contributions to modern NLP. Today BPE sits inside every LLM.

Предварительные знания

Language Models: from n-gram to GPT

Byte Pair Encoding: From Characters to Subwords

Ask GPT-2: what is 1234 + 5678? Wrong answer. Not because the model can't add - because the number "5678" splits into the tokens "56" and "78". The model sees **two separate pieces**, not one number. The culprit: **tokenization** - the process that turns text into a sequence of numbers (token IDs) *before* the neural network even starts. Invisible, critically important.

Why not just split text by words? Three reasons: 1. **vocabulary explodes** - English has ~170,000 words, and once forms, typos, and names join in, the count hits millions 2. **OOV (out-of-vocabulary)** - new words, abbreviations, and slang never make it into the vocabulary 3. **morphology** - "play", "plays", "playing" become three separate tokens despite sharing one root. Solution: **subword tokenization** - split words into frequent substrings.

**Byte Pair Encoding (BPE)** - the most popular subword tokenization algorithm. Originally a data compression scheme (Gage, 1994), adapted for NLP (Sennrich et al., 2016). The idea is simple: start with individual characters and **iteratively merge** the most frequent pair into a new token. Repeat until the vocabulary hits the target size.

**Applying trained BPE to new text.** After training, an ordered list of merges sits ready. To tokenize a new word: 1. split into characters 2. apply merges **in training order** - each merge replaces a pair when it appears. Unknown words break apart into known substrings. The word "lowest" -> "low" + "est" - the model *sees* the root and suffix.

**GPT-2/3/4 use Byte-level BPE.** Instead of Unicode characters, they operate on **bytes** (256 base tokens). Guarantees *any* text (emoji, hieroglyphs, binary data) tokenizes without OOV errors. Tiktoken - OpenAI's library for fast BPE tokenization.

A BPE tokenizer is trained on an English corpus. How will it handle the new German word "Handschuh" (glove), which is not in the vocabulary?

WordPiece: Maximizing Likelihood

**WordPiece** (Schuster & Nakajima, 2012) - the algorithm running inside **BERT**, **DistilBERT**, and other Google models. Same spirit as BPE, with one key difference in the merge criterion: BPE merges the most **frequent** pair; WordPiece merges the pair that **maximally lifts the likelihood** of the training corpus.

Formally, WordPiece picks the pair (a, b) for merging that maximizes: **score(a, b) = freq(ab) / (freq(a) * freq(b))**. This echoes **pointwise mutual information (PMI)** - how often two characters appear together vs how often they would by chance. The pair "qu" scores high because "q" almost always travels with "u".

**Marking continuations with ##.** WordPiece tags tokens that are *continuations* of a word with the `##` prefix. The model now distinguishes the start of a word from its middle. Examples: "playing" -> ["play", "##ing"], "unbelievable" -> ["un", "##believ", "##able"]. Without `##`, the model could not tell "a play" (noun) apart from "play##ing" (part of a word).

Aspect	BPE	WordPiece
Merge criterion	Maximum pair frequency	Maximum likelihood (PMI)
Marking	No special marker	## for continuations
Used in	GPT-2/3/4, RoBERTa, LLaMA	BERT, DistilBERT, Electra
Handling rare words	Split into subwords	Split into subwords with ##
Vocabulary size	32K-100K (GPT: 50,257)	30K (BERT: 30,522)

**In practice the gap between BPE and WordPiece is small.** Both produce subword tokenizations of comparable quality. The choice comes down to the ecosystem: OpenAI/Meta models lean on BPE, Google models on WordPiece. New projects more often pick BPE or Unigram.

The BERT tokenizer (WordPiece) splits the word "playing" into ["play", "##ing"]. Why is the ## prefix needed?

SentencePiece: Language-Agnostic Tokenization

BPE and WordPiece assume text has already been **pre-processed**: split into words by spaces, cleaned of stray characters, normalized. Works for English; falls apart on **Japanese, Chinese, Thai** (no-space languages) or **German** (long compound words). **SentencePiece** (Kudo & Richardson, 2018) cracks this: it works on **raw text** directly, including spaces as part of the alphabet.

Key idea: SentencePiece **needs no pre-tokenization**. Spaces become the special character `▁` (Unicode U+2581) and live inside tokens. The text "I love cats" turns into "▁I▁love▁cats", and tokenization runs on this chain of characters. The algorithm becomes **language-agnostic** - works equally well in any language.

SentencePiece supports two algorithms: **BPE** (above) and **Unigram** (Kudo, 2018). Unigram works the opposite way: instead of iteratively *adding* tokens (BPE), it starts with a **huge** vocabulary and iteratively *removes* the least useful ones. Each step, Unigram drops the token whose removal **least raises the corpus loss**.

**Who uses SentencePiece?** LLaMA (Meta), T5 (Google), ALBERT, XLNet, mBART - all multilingual models. GPT-2/3/4 run their own byte-level BPE through tiktoken, but the idea is identical: working with bytes instead of characters makes the tokenizer language-agnostic.

**Unigram vs BPE in practice.** Unigram has a theoretical edge: it can hand back *multiple* possible segmentations with probabilities - handy for data augmentation. During model training, different tokenizations of the same text get sampled - **subword regularization** (Kudo, 2018) - and it boosts model robustness.

Why does SentencePiece represent spaces with the special character ▁ and include it in tokens, rather than simply splitting text by spaces?

Vocabulary Size and Special Tokens

One of the most important hyperparameters of a tokenizer is **vocabulary size**. GPT-2: 50,257 tokens. LLaMA: 32,000. GPT-4: 100,256. Not random numbers - they sit at a fundamental **trade-off** that directly drives model quality.

**How does vocabulary size shape sequence length?** The sentence "Tokenization is important for language models" with a 1K vocabulary splits into ~15 tokens (lots of short pieces); with a 100K vocabulary - ~6 tokens (whole words). Critical detail: Transformer has **O(n^2)** complexity in sequence length. Twice as long = four times more expensive in memory and compute.

**Special tokens** are reserved IDs with special meaning for the model. They never show up in ordinary text - they steer the model's behavior.

Token	Meaning	Used in
[CLS]	Beginning of input; its embedding = representation of the whole text	BERT
[SEP]	Separator between two text segments	BERT
[MASK]	Masked token (for prediction during training)	BERT
[PAD]	Padding to equal length in a batch	All models
<\|endoftext\|>	End of document / start of a new one	GPT-2/3
<\|im_start\|>	Start of a message (role: system/user/assistant)	ChatGPT
<\|im_end\|>	End of a message	ChatGPT
<s>, </s>	Start and end of sequence	LLaMA, T5

**Multilingual discrimination.** Tokenizers trained mostly on English burn 2-5x more tokens on the same text in other languages. A 10-word sentence in English = ~10 tokens, in French = ~15, in Japanese = ~30. Consequences: 1. less context fits in the window 2. generation runs slower 3. **costs more** (APIs charge per token). LLaMA 3 expanded the vocabulary to 128K specifically for better multilingual support.

**Byte-fallback** - the safety net. If the tokenizer cannot split a byte sequence into known subwords, it falls back to **individual bytes** (0-255). Guarantees *any* input tokenizes - from UTF-8 text to binary data. GPT-2+ runs byte-level BPE: the base 256 tokens are bytes, everything else is merges layered on top.

**Why could GPT-2 not do arithmetic?** The number "13579" tokenized as "135" + "79" or "1" + "3579" - unpredictably. The model never saw a whole number, so arithmetic was impossible. Modern fixes: 1. dedicated number tokenization (each digit as its own token) 2. chain-of-thought prompting (break the computation into steps) 3. tool use (calculator).

Key Ideas

**BPE** - iterative merging of frequent character pairs. Starts with the alphabet, adds tokens bottom-up. Used in GPT-2/3/4, LLaMA. Remember how "5678" split into "56" + "78"? The reason is now clear
**WordPiece** - like BPE, but selects pairs by likelihood (PMI) rather than frequency. The ## marker for word continuations. Used in BERT
**SentencePiece** - language-agnostic: works with raw text, space ▁ is part of the alphabet. Handles English, Japanese, Arabic the same way
**Vocabulary size** (32K-100K) - a critical trade-off: small = long sequences, large = sparse embeddings. Tokenization determines what the model *sees* - and what it cannot

Вопросы для размышления

If a tokenizer is trained on 90% English text, the model will perform worse on other languages. How can a "fair" multilingual tokenizer be designed? What trade-offs arise?
BPE splits numbers unpredictably: "12345" -> "123" + "45" or "1" + "2345". How can the arithmetic problem in LLMs be solved without changing the model architecture?
A Unigram model can give *multiple* possible segmentations of the same text with different probabilities. How can this be used to improve model training? (Hint: data augmentation)

Связанные уроки

gai-02 — Language models predict tokens - need to understand what exactly gets predicted
gai-04 — Embedding layer converts token IDs to vectors - vocab size determines the matrix
gai-05 — Tokenization quality affects the quality of learned embeddings
nlp-03 — TF-IDF also works with word frequencies - different approaches to text representation
it-01 — BPE is a compression algorithm - information theory explains why frequent pairs merge
alg-01 — BPE is a greedy algorithm; big-O intuition helps estimate training cost
fl-05-regex