Machine Learning
Word Embeddings: Word2Vec, GloVe, FastText
In 2013, Tomas Mikolov from Google discovered something astonishing. He trained a simple neural network on billions of words and obtained numeric vectors for each word. And then it turned out that you could do math with these vectors: vector(king) - vector(man) + vector(woman) = vector(queen). For the first time, a computer had learned to perform arithmetic on meaning. Subtract man from king, add woman - get queen. How can numbers capture the meanings of words? And why do subtraction and addition work on meanings just like on numbers?
- **Search engines** - Google, Bing, and others use word embeddings to understand queries: searching for 'how to fix a leaky faucet' finds articles about 'plumbing repair' because embeddings know these words are semantically close, even if not a single query word appears in the document
- **Recommendation systems** - Spotify, Netflix, and Amazon represent items and content as vectors (by analogy with word embeddings) and recommend similar items via cosine similarity: if you like one movie, the system finds the nearest ones in embedding space
- **Machine translation** - embeddings for different languages can be aligned into a shared space where 'cat' and 'Katze' are nearby, enabling translation between languages even without parallel corpora (unsupervised translation)
Предварительные знания
Teaching words to live in vector space
In 2003 Yoshua Bengio and colleagues proposed a neural language model that learned a dense vector for each word while predicting the next one, the first clear demonstration that meaning could be packed into continuous coordinates. The idea reached the masses in 2013, when Tomas Mikolov's team at Google released word2vec, a fast method whose famous king minus man plus woman equals queen arithmetic showed that vectors capture analogies. A year later Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford released GloVe, which reached similar embeddings by factorizing global word co-occurrence counts. Together they made pretrained word vectors a standard starting point for NLP.
Word2Vec: A Word Is Known by the Company It Keeps
In 2013, Tomas Mikolov from Google published Word2Vec, a model that transformed NLP. The idea is simple and powerful: **the meaning of a word is determined by its context**. If two words appear in similar surroundings ("the cat sat on the ...", "the dog sat on the ..."), they are semantically close. Word2Vec trains a neural network to predict the context from a word (or the word from its context), and the byproduct of training - **word vectors** - turns out to be incredibly useful.
Internally, Word2Vec is a simple neural network with one hidden layer. The input word is encoded as a one-hot vector of size V (vocabulary size, typically 50–300 thousand words). The hidden layer has dimensionality d (usually 100–300) - this is what forms the **embedding vector** for the word. The output layer predicts probabilities of context words. After training, we discard the input and output layers and keep only the weight matrix of the hidden layer - these are our word embeddings.
**The magic of Word2Vec - arithmetic of meaning:** Trained vectors capture semantic relationships: - vector("king") - vector("man") + vector("woman") = vector("queen") - vector("Paris") - vector("France") + vector("Germany") = vector("Berlin") - vector("walking") - vector("walk") + vector("swim") = vector("swimming") This works because Word2Vec learns **directions** in the space: - "man" -> "woman" is the direction of gender - "king" -> "queen" is the same direction! The vector (king - man) captures the concept of "royalty" without gender. Adding "woman" gives "royalty + female" = queen.
Word2Vec trains quickly - Google trained a model on 100 billion words in a few hours on a cluster. The secret to speed is **negative sampling**: instead of running softmax over all 300,000 words (expensive!), the model learns to distinguish genuine context pairs from random "noise" pairs. This reduces computational complexity by orders of magnitude without significant quality loss.
Why does the arithmetic vector('king') - vector('man') + vector('woman') yield a result close to vector('queen')?
GloVe: Global Co-occurrence Statistics
Word2Vec trains on local context windows - it sees 5–10 words around each word. But language has global patterns that a local window might miss. In 2014, a team from Stanford (Pennington, Socher, Manning) proposed **GloVe (Global Vectors)** - a model that combines two approaches: counting global word co-occurrence statistics (like classical methods) and learning dense vectors (like Word2Vec).
The GloVe idea: if *ice* often co-occurs with *solid* while *steam* co-occurs with *gas*, then the ratio of their co-occurrence probabilities P(solid|ice) / P(solid|steam) will be much greater than 1. For a neutral word like *water*, the ratio P(solid|ice) / P(solid|steam) will be close to 1. GloVe trains vectors so that the **dot product of two vectors approximates the log of their co-occurrence count**: w_i · w_j + b_i + b_j = log(X_ij).
**GloVe vs Word2Vec - two approaches to the same goal:** **Count-based methods** (LSA, HAL): build a co-occurrence matrix and factorize it (SVD). Use global statistics but struggle with analogies. **Predictive methods** (Word2Vec): train a neural network on local contexts. Capture analogies but don't use global statistics. **GloVe - the combination of both:** - First builds the global matrix X (counting) - Then trains vectors through optimization (prediction) - Loss function: sum_ij f(X_ij) * (w_i · w_j + b_i + b_j - log(X_ij))^2 - f(X_ij) - a weighting function so rare pairs don't dominate Result: analogy quality on par with Word2Vec + better use of corpus statistics.
In practice, GloVe and Word2Vec deliver comparable quality. GloVe is popular thanks to pretrained models from Stanford: **GloVe 6B** (6 billion tokens, Wikipedia + Gigaword, dimensions 50/100/200/300), **GloVe 42B** (42 billion tokens, Common Crawl), **GloVe 840B** (840 billion tokens, 2.2 million words). For most tasks GloVe 6B at dimension 100 or 300 is sufficient - they are free, trained on high-quality data, and widely used as a baseline.
What is the main difference between GloVe and Word2Vec?
FastText: The Power of Character N-grams
Word2Vec and GloVe operate at the level of whole words: each word is a separate vector. But what if the model encounters a word it has never seen in training? For example, a medical term like "anticoagulation" or a new slang word? Word2Vec and GloVe simply don't know what to do with it - this is the **OOV (Out-Of-Vocabulary)** problem. In 2016, the Facebook AI Research team (Bojanowski, Grave, Joulin, Mikolov - yes, him again!) proposed **FastText**, which solves this problem by working at the level of **character n-grams** (subwords).
The key advantage of FastText is handling **OOV words** (words not seen during training). If the model has never seen the word "anticoagulation", it can still build a vector for it from familiar n-grams: "anti", "coag", "tion", etc. Word2Vec and GloVe are helpless in this situation - they raise an error. FastText is especially useful for **morphologically rich languages** (Russian, Turkish, Finnish), where a single word can have dozens of forms: "play", "playing", "player", "replayed" - all of them will have similar vectors.
**FastText vs Word2Vec - three key differences:** 1. **Unit of training:** - Word2Vec: whole words - FastText: character n-grams (subwords) 2. **OOV word handling:** - Word2Vec/GloVe: error (word not in vocabulary) - FastText: builds a vector from n-grams --> works for any word 3. **Morphology:** - Word2Vec: "run" and "running" are two independent vectors - FastText: shared n-grams "run", "ru" connect word forms **When FastText is better:** - Morphologically rich languages (Russian, German, Turkish) - Domain texts with rare terms (medicine, law) - Texts with typos or slang **When Word2Vec/GloVe is sufficient:** - English and other analytic languages - Large clean corpus covering the needed vocabulary
FastText from Facebook comes with **pretrained models for 157 languages**, trained on Common Crawl and Wikipedia. The English model contains 2 million words at dimensionality 300. This makes FastText the best choice for multilingual projects and languages with rich morphology.
How does FastText handle a word that was not in the training data (OOV)?
Embedding Space: The Geometry of Meaning
Word2Vec, GloVe, and FastText create a space where each word is a point in d-dimensional space (typically d = 100-300). But this is not just a set of points - the space has **geometric structure** that reflects the semantics of language. Semantically similar words form clusters, and semantic relationships are encoded as **directions** (difference vectors). Consider this geometry.
**Analogy arithmetic** is not a magic trick but a consequence of the linear structure of the space. The vector (king - queen) is approximately equal to (man - woman) because both encode the concept of gender. The vector (Paris - France) is approximately equal to (Tokyo - Japan) because both encode the "capital → country" relationship. These parallel directions arise because words are used in systematically similar contexts.
**The bias problem in embeddings:** Word embeddings are trained on human-written text and absorb **social stereotypes** from the data: - vector("doctor") is closer to vector("man") than to vector("woman") - vector("nurse") is closer to vector("woman") - vector("programmer") is closer to vector("man") This is not a bug in the algorithm - it reflects the bias present in the texts. But using such embeddings in a resume screening system will result in gender discrimination. **Mitigation methods:** - Debiasing (Bolukbasi et al., 2016): identify the "gender direction" and project neutral words orthogonally to it - Curated training data: filter or balance the training corpus - Post-hoc auditing: test the model for bias before deployment
**Limitation of static embeddings - one word = one vector:** Word2Vec, GloVe, and FastText give each word **one fixed vector**, regardless of context. But many words are polysemous: - "bank" = a river bank OR a financial institution - "bat" = a flying mammal OR a baseball bat - "crane" = a bird OR a construction machine A static embedding is an **average** of all meanings. For the word "bank" the vector will sit somewhere between "finance" and "geography". **Solution - contextual embeddings (ELMo, BERT, GPT):** - The word's vector depends on the sentence - "I went to the bank to deposit money" --> vector for "bank" is closer to "finance" - "I sat on the river bank" --> vector for "bank" is closer to "shore" Word2Vec/GloVe/FastText are the foundation, the bridge between human language and mathematics. The very discovery of king - man + woman = queen showed that language has a regular geometric structure that can be learned from context. Contextual models (BERT, GPT) are the next step in that evolution, built on this foundation.
Every word has one fixed meaning, so one vector per word is the right approach
Static embeddings (Word2Vec, GloVe, FastText) give one averaged representation for polysemous words - for example, 'bank' (riverbank/financial institution). Contextual models (ELMo, BERT) solve this by creating different vectors depending on the sentence
Most frequent words are polysemous: 'crane' (bird/machine), 'spring' (season/coil/water source). A static embedding averages all meanings into one vector, hurting quality on tasks where context is critical. BERT and similar models generate a word's vector taking its surrounding context into account, allowing disambiguation between 'I went to the bank (financial)' and 'I sat on the river bank (geographic)'.
Why can word embeddings contain gender and racial stereotypes?
Key Ideas
- **Word2Vec** trains a neural network to predict a word's context (Skip-gram) or the word from its context (CBOW), and the byproduct - embedding vectors - captures semantic relationships, including analogy arithmetic
- **GloVe** combines counting and predictive approaches: it builds a global co-occurrence matrix and trains vectors to approximate its logarithm, leveraging statistics of the entire corpus at once
- **FastText** operates at the level of character n-grams rather than whole words, enabling it to build meaningful vectors for words never seen in training (OOV) and to better handle morphologically rich languages
- **Embedding space** has geometric structure: clusters of semantically similar words, parallel directions for analogies, but also social biases from training data - and Mikolov's discovery in 2013 that king - man + woman = queen showed that language has a regular mathematical structure that can be learned from context
Related Topics
Word embeddings are the key bridge between text preprocessing and modern language models:
- Text Preprocessing — Preprocessing (tokenization, lemmatization, cleaning) prepares text for training embeddings. The quality of input data directly affects the quality of the vectors: garbage in, garbage out
- BERT and GPT — Contextual models (BERT, GPT) are the evolution of word embeddings: instead of one static vector per word, they create dynamic representations that depend on the sentence. The Transformer architecture replaced Word2Vec as the NLP standard
- Seq2Seq — Sequence-to-sequence models (machine translation, summarization) use word embeddings as the input layer: words are converted to vectors that the encoder-decoder architecture then processes
- Neural Networks — Word2Vec is really just a two-layer neural network. An embedding layer (matrix W1) is present in any neural network for NLP as the first layer, transforming discrete words into continuous vectors
Вопросы для размышления
- Word2Vec learns to predict context, not word meaning directly. Why does the byproduct (embedding vectors) turn out to be so useful for semantic tasks? What does this say about the relationship between context and meaning in natural language?
- FastText addresses OOV through character n-grams. But are there situations where n-grams would produce a bad vector for an unfamiliar word? Give an example of a word for which n-grams would be misleading.
- Word embeddings absorb biases from training texts. Should we remove bias from embeddings (risking the loss of useful information) or leave it and handle bias at the application level? What trade-offs arise?
Связанные уроки
- ml-34-text-preprocessing — Embeddings consume preprocessed tokens
- ml-37-bert-gpt — Contextual embeddings extend static vectors
- ml-36-seq2seq — Seq2seq encoders use word embeddings
- la-13-eigenvectors — Embedding geometry uses vector spaces
- aie-09-embeddings — Modern API embeddings serve the same role
- aie-12-rag-fundamentals