Trigonometry

Trigonometry in Machine Learning

GPT-4, LLaMA, BERT, every one of these models uses trigonometry internally. Positional Encoding in transformers is literally sin and cos. RoPE rotates embeddings. Without trigonometry, modern AI would not exist.

LLaMA / GPT: RoPE enables the model to generalise to longer sequences than seen in training
BERT / GPT-2: sinusoidal positional encoding encodes token order
Neural audio synthesis (WaveNet, EnCodec): Snake activation for periodic audio signals

Предварительные знания

Positional Encoding in Transformers

GPT-4 uses sinusoidal positional encoding: PE(p,d) = sin(p/10000^(2i/d)) , enabling context windows of up to 128,000 tokens. Transformers process all tokens in parallel and have no inherent notion of position. Positional Encoding adds a vector to each embedding that encodes position using sines and cosines of different frequencies. This gives the model information about the order of tokens.

**Positional Encoding** (Vaswani et al., 2017, original transformer): PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model}) where pos is the token position, i is the dimension index, d_model is the model dimension. **Why sines/cosines?** 1. |PE(pos)| ≈ const, no explosive growth 2. PE(pos+k) can be expressed linearly through PE(pos), model can generalise to any offsets 3. Different frequencies encode 'near' and 'far' relationships

Positional Encoding uses frequencies 1/10000^{2i/d}. Why are DIFFERENT frequencies needed for different dimensions i?

RoPE: Rotary Position Embedding

RoPE (Su et al., 2021) is a more elegant approach to position encoding: instead of adding a PE vector to the embedding, we ROTATE the embedding in the vector space by an angle that depends on position. Used in LLaMA, GPT-NeoX, PaLM, and most modern LLMs.

**RoPE**: rotate each pair of dimensions (x_{2i}, x_{2i+1}) by angle m·θᵢ: f(q, m) = R_m · q where the rotation matrix for pair (2i, 2i+1) is: R_m(i) = | cos(m·θᵢ) -sin(m·θᵢ) | | sin(m·θᵢ) cos(m·θᵢ) | θᵢ = 10000^{-2i/d} **Key property**: the dot product depends only on the DIFFERENCE of positions: q_m^T · k_n = q^T · R_{n-m} · k = f(q, m)^T · f(k, n) Perfect for attention: the model sees only relative positions.

What is the main advantage of RoPE over the original Positional Encoding?

Geometry of Attention: Cosine Similarity

The attention mechanism computes the dot product of query and key vectors. The normalised dot product is the cosine of the angle between the vectors. Understanding attention as the geometry of cosine similarity explains why scaling by √d_k is necessary.

**Scaled Dot-Product Attention:** Attention(Q, K, V) = softmax(QK^T / √d_k) · V **Cosine similarity:** cos(q, k) = q·k / (||q|| · ||k||) **Why divide by √d_k?** If q, k ~ N(0,1), then q·k ~ N(0, d_k). The standard deviation grows as √d_k. Dividing normalises: q·k/√d_k ~ N(0,1). Without scaling, at large d_k softmax saturates → gradients vanish.

Why does Scaled Dot-Product Attention divide by √d_k rather than d_k?

Trigonometric Activations: Swish, GELU, and Snake

Classic ReLU is simple, but not smooth at zero. Modern activation functions use smooth approximations through erf (GELU), sigmoid (Swish), and explicit trigonometric functions (Snake). Knowing these is important for choosing a neural network architecture.

**GELU** (Gaussian Error Linear Unit): GELU(x) = x · Φ(x) ≈ x · σ(1.702x) where Φ is the CDF of the normal distribution, erf(x) = (2/√π)∫₀ˣ e^{-t²}dt **Swish** (Google Brain, 2017): Swish(x) = x · σ(x) = x / (1 + e^{-x}) **Snake** (explicit trigonometry, 2020): Snake_α(x) = x + sin²(αx) / α Snake is useful for modelling periodic functions: audio, physical processes, time series.

Why is the Snake activation sin²(αx)/α particularly useful for modelling audio and physical signals?

Key Ideas

PE(pos, 2i) = sin(pos/10000^{2i/d}): different frequencies encode near and far positional relationships
RoPE: rotate the embedding, attention score depends only on the position difference
Scaled attention divides by √d_k: normalises variance when components are N(0,1)
Snake(x, α) = x + sin²(αx)/α: ideal for periodic signals in neural networks

Вопросы для размышления

Prove that f(q,m)·f(k,n) = g(q,k,n-m) for RoPE, i.e., the dot product depends only on the position difference.
Why is masking in decoder attention implemented by adding -∞ to scores before softmax, rather than zeroing out after? What happens to the gradients?
Compare Snake(x, α) with a Fourier series: what is their mathematical kinship? How would one generalise Snake to an arbitrary set of frequencies?

Связанные уроки

ml-01-intro

Positional Encoding in Transformers

Positional Encoding uses frequencies 1/10000^{2i/d}. Why are DIFFERENT frequencies needed for different dimensions i?

RoPE: Rotary Position Embedding

What is the main advantage of RoPE over the original Positional Encoding?

Geometry of Attention: Cosine Similarity

Why does Scaled Dot-Product Attention divide by √d_k rather than d_k?

Trigonometric Activations: Swish, GELU, and Snake

Why is the Snake activation sin²(αx)/α particularly useful for modelling audio and physical signals?

Key Ideas

PE(pos, 2i) = sin(pos/10000^{2i/d}): different frequencies encode near and far positional relationships

RoPE: rotate the embedding, attention score depends only on the position difference

Scaled attention divides by √d_k: normalises variance when components are N(0,1)

Snake(x, α) = x + sin²(αx)/α: ideal for periodic signals in neural networks

Trigonometry in Machine Learning

Предварительные знания

Positional Encoding in Transformers

RoPE: Rotary Position Embedding

Geometry of Attention: Cosine Similarity

Trigonometric Activations: Swish, GELU, and Snake

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Trigonometry in Machine Learning

Предварительные знания

Positional Encoding in Transformers

RoPE: Rotary Position Embedding

Geometry of Attention: Cosine Similarity

Trigonometric Activations: Swish, GELU, and Snake

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки