Natural Language Processing

Word Embeddings: Word2Vec, GloVe

2013. Google Brain: Tomas Mikolov notices a strange pattern in a 300-dimensional vector space. `vector('king') - vector('man') + vector('woman') ~= vector('queen')`. The result holds across thousands of analogies. This moment flips NLP: semantics is encoded as geometry. Relationships between words are directions in space. Word2Vec has since been downloaded tens of millions of times; the idea of dense embeddings became the foundation for BERT, GPT, and every subsequent language model.

**Spotify**: Word2Vec on track sequences ('track2vec') - track embeddings for Discover Weekly recommendations
**Airbnb**: 'listing2vec' on booking sequences - semantic space for finding similar listings
**Pinterest**: 'pin2vec' for visual content recommendation - the analogy from text semantics transferred to images

Предварительные знания

Tokenization and text preprocessing
Bag of Words and TF-IDF: sparse text representations
Linear algebra basics: vectors, dot product, cosine distance

Historical context

In January 2013 Tomas Mikolov and colleagues at Google published two papers back-to-back: 'Efficient Estimation of Word Representations in Vector Space' (Word2Vec) and 'Distributed Representations of Words and Phrases and their Compositionality'. LSA (1990) and NNLM (2003) existed before, but were too slow. Mikolov removed the hidden layer and added negative sampling - training on the Google News corpus (100B words) took days instead of months. The code was open-sourced the same year. Stanford responded with GloVe in 2014, Facebook with FastText in 2017. In 4 years the industry fully moved from sparse bag-of-words to dense embeddings.

Word2Vec: neural embeddings

Word2Vec (Mikolov, 2013) trains dense vector representations of words by predicting context. Two variants: **Skip-gram** - predict context words from the center word; **CBOW** - predict the center word from context words. Key property: geometric relationships in the space mirror semantic ones. `king - man + woman ~= queen`.

**Negative sampling** is critical for efficiency: instead of softmax over the entire vocabulary (millions of words) - binary classification for K=5-20 random negative words. Sampling is frequency-weighted: P(w) proportional to freq(w)^(3/4) - this reduces the dominance of frequent words.

Why does Word2Vec use negative sampling instead of softmax over the entire vocabulary?

GloVe: co-occurrence matrix

GloVe (Pennington et al., Stanford, 2014) - Global Vectors for Word Representation. Instead of the local context of a sliding window (Word2Vec), it uses the **global co-occurrence matrix** X, where X_ij is how many times word i appears in the context of j across the corpus. The objective: the dot product of vectors for words i and j should approximate log(X_ij).

GloVe finds analogies as well as Word2Vec but trains faster on a pre-computed matrix. The final embedding = average of w + wt (word + context vectors). On the **word analogy** benchmark (Mikolov, 2013) GloVe reached 75% accuracy vs 65% for Word2Vec at the same corpus size.

What is the key architectural difference between GloVe and Word2Vec?

FastText: subword embeddings

FastText (Bojanowski et al., Facebook AI, 2017) extends Word2Vec to the **character n-gram** level. Instead of one vector per word, a word is represented as the sum of vectors of its n-grams (n=3-6). For example, 'running' = <ru + run + unn + nni + nin + ing + ng> + <running>. This solves the OOV (out-of-vocabulary) problem.

FastText outperforms Word2Vec and GloVe on morphologically rich languages (Russian, Finnish, Turkish) - many unique word forms. The advantage on English is smaller. Facebook uses FastText for language identification (176 languages) and multilingual text classification tasks.

Why does FastText outperform Word2Vec for morphologically rich languages (Russian, German)?

Embedding geometry and analogies

The key discovery of Word2Vec: semantic relationships are encoded as geometric directions in the space. `king - man + woman = queen`. `Paris - France + Italy = Rome`. This happens because words in similar contexts get close vectors. **Cosine similarity** is the standard proximity measure: cos(a, b) = (a * b) / (|a| |b|).

**Bias in embeddings**: Word2Vec trains on human-generated text and amplifies bias. `doctor - man + woman = nurse`. `programmer - man + woman = homemaker`. This is a technical problem with serious consequences for production systems. Debiasing methods: Hard Debiasing (Bolukbasi, 2016), Sent-Debiasing.

BERT and GPT embeddings replaced Word2Vec/GloVe and they are no longer used

Static embeddings (Word2Vec, GloVe, FastText) are faster, lighter, and sufficient for many tasks: keyword extraction, similarity matching, multilingual classification

BERT requires GPU inference and hundreds of milliseconds; FastText runs on CPU in microseconds - critical for real-time systems handling billions of requests

Why does `king - man + woman ~= queen` work in Word2Vec embedding space?

Key ideas

**Word2Vec** (skip-gram/CBOW): context prediction via negative sampling, semantics as geometric space
**GloVe**: global co-occurrence matrix - weighted least squares instead of local window prediction
**FastText**: subword n-gram embeddings - OOV words, morphologically rich languages, 176 languages
**Analogies**: `king - man + woman = queen` - semantic relationships as geometric offsets

Вопросы для размышления

How does frequency weighting P(w) proportional to freq(w)^0.75 in negative sampling affect embedding quality for rare words?
Why does GloVe average word and context vectors in the final embedding instead of using only the word vector?
How to explain 'king - man + woman = queen' through distributional semantics: why do the contexts of 'king' and 'woman' yield 'queen'?

Связанные уроки

nlp-03 — Bag-of-words and TF-IDF - the sparse representations that dense embeddings replace
nlp-05 — Sequence models build on top of word embeddings as input representations
ml-35-word-embeddings — Word2Vec/GloVe/FastText are concrete implementations of the word embeddings concept from the ML course
aie-09-embeddings — Word embeddings for text are analogous to embedding vectors in search - both encode semantics in dense vector space
dl-03 — Word2Vec Skip-gram applies self-supervised learning principles - context prediction without explicit labels
la-02-dot-product