Trigonometry

Trigonometry in Machine Learning

GPT-4, LLaMA, BERT, every one of these models uses trigonometry internally. Positional Encoding in transformers is literally sin and cos. RoPE rotates embeddings. Without trigonometry, modern AI would not exist.

  • LLaMA / GPT: RoPE enables the model to generalise to longer sequences than seen in training
  • BERT / GPT-2: sinusoidal positional encoding encodes token order
  • Neural audio synthesis (WaveNet, EnCodec): Snake activation for periodic audio signals

Предварительные знания

  • Law of Cosines and Law of Sines: Arbitrary Triangles
  • Trigonometric Substitution and Product-to-Sum Formulas

Positional Encoding in Transformers

GPT-4 uses sinusoidal positional encoding: PE(p,d) = sin(p/10000^(2i/d)) , enabling context windows of up to 128,000 tokens. Transformers process all tokens in parallel and have no inherent notion of position. Positional Encoding adds a vector to each embedding that encodes position using sines and cosines of different frequencies. This gives the model information about the order of tokens.

**Positional Encoding** (Vaswani et al., 2017, original transformer): PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model}) where pos is the token position, i is the dimension index, d_model is the model dimension. **Why sines/cosines?** 1. |PE(pos)| ≈ const, no explosive growth 2. PE(pos+k) can be expressed linearly through PE(pos), model can generalise to any offsets 3. Different frequencies encode 'near' and 'far' relationships

Positional Encoding uses frequencies 1/10000^{2i/d}. Why are DIFFERENT frequencies needed for different dimensions i?

RoPE: Rotary Position Embedding

RoPE (Su et al., 2021) is a more elegant approach to position encoding: instead of adding a PE vector to the embedding, we ROTATE the embedding in the vector space by an angle that depends on position. Used in LLaMA, GPT-NeoX, PaLM, and most modern LLMs.

**RoPE**: rotate each pair of dimensions (x_{2i}, x_{2i+1}) by angle m·θᵢ: f(q, m) = R_m · q where the rotation matrix for pair (2i, 2i+1) is: R_m(i) = | cos(m·θᵢ) -sin(m·θᵢ) | | sin(m·θᵢ) cos(m·θᵢ) | θᵢ = 10000^{-2i/d} **Key property**: the dot product depends only on the DIFFERENCE of positions: q_m^T · k_n = q^T · R_{n-m} · k = f(q, m)^T · f(k, n) Perfect for attention: the model sees only relative positions.

What is the main advantage of RoPE over the original Positional Encoding?

Geometry of Attention: Cosine Similarity

The attention mechanism computes the dot product of query and key vectors. The normalised dot product is the cosine of the angle between the vectors. Understanding attention as the geometry of cosine similarity explains why scaling by √d_k is necessary.

**Scaled Dot-Product Attention:** Attention(Q, K, V) = softmax(QK^T / √d_k) · V **Cosine similarity:** cos(q, k) = q·k / (||q|| · ||k||) **Why divide by √d_k?** If q, k ~ N(0,1), then q·k ~ N(0, d_k). The standard deviation grows as √d_k. Dividing normalises: q·k/√d_k ~ N(0,1). Without scaling, at large d_k softmax saturates → gradients vanish.

Why does Scaled Dot-Product Attention divide by √d_k rather than d_k?

Trigonometric Activations: Swish, GELU, and Snake

Classic ReLU is simple, but not smooth at zero. Modern activation functions use smooth approximations through erf (GELU), sigmoid (Swish), and explicit trigonometric functions (Snake). Knowing these is important for choosing a neural network architecture.

**GELU** (Gaussian Error Linear Unit): GELU(x) = x · Φ(x) ≈ x · σ(1.702x) where Φ is the CDF of the normal distribution, erf(x) = (2/√π)∫₀ˣ e^{-t²}dt **Swish** (Google Brain, 2017): Swish(x) = x · σ(x) = x / (1 + e^{-x}) **Snake** (explicit trigonometry, 2020): Snake_α(x) = x + sin²(αx) / α Snake is useful for modelling periodic functions: audio, physical processes, time series.

Why is the Snake activation sin²(αx)/α particularly useful for modelling audio and physical signals?

Key Ideas

  • PE(pos, 2i) = sin(pos/10000^{2i/d}): different frequencies encode near and far positional relationships
  • RoPE: rotate the embedding, attention score depends only on the position difference
  • Scaled attention divides by √d_k: normalises variance when components are N(0,1)
  • Snake(x, α) = x + sin²(αx)/α: ideal for periodic signals in neural networks

Related Topics

Trigonometry in ML connects pure mathematics with modern AI engineering:

  • Trigonometric Series and Fourier — Positional Encoding is a discrete analogue of Fourier features over positions
  • Trigonometry in Computer Graphics — Rotation matrices from trigonometry appear in both attention and 3D transformations

Вопросы для размышления

  • Prove that f(q,m)·f(k,n) = g(q,k,n-m) for RoPE, i.e., the dot product depends only on the position difference.
  • Why is masking in decoder attention implemented by adding -∞ to scores before softmax, rather than zeroing out after? What happens to the gradients?
  • Compare Snake(x, α) with a Fourier series: what is their mathematical kinship? How would one generalise Snake to an arbitrary set of frequencies?

Связанные уроки

  • ml-01-intro
Trigonometry in Machine Learning

0

1

Sign In