Geometry
Geometric Transformations
Цели урока
- Encode rotation, scale, reflection, and shear as two-by-two matrices
- Use homogeneous coordinates so translation joins the matrix product
- Compose maps right to left through matrix products
- Tell affine and projective transforms apart by what they preserve
- Spot the same primitives in NeRF, SLAM, ARKit, and torchvision augmentation
Every Vulkan or Metal frame walks each vertex through a four-by-four matrix. An RTX 4090 lands roughly one hundred million of these per second per pixel-pass before anything reaches the screen. NeRF and Tesla Autopilot run the inverse problem: rebuild a three-D scene by inverting the same matrices.
- **CSS and SVG transforms (W3C, 2012):** matrix(a,b,c,d,tx,ty) is one affine three-by-three on every animated UI element
- **GPU pipeline (OpenGL, Vulkan, Metal):** model, view, projection - three four-by-fours per vertex
- **Tesla Autopilot and ARKit:** SLAM stitches the world from a stream of camera extrinsics, millions of matrix products per second
- **torchvision.transforms.RandomAffine:** random affine warp on every batch of every CV training run
- **Spatial Transformer Networks (Jaderberg, 2015):** the CNN itself learns the affine matrix to apply
Предварительные знания
- Coordinate geometry and plane vectors
- Matrix multiplication and its core properties
Felix Klein's Erlangen Programme
In 1872 Felix Klein delivered the inaugural lecture at the University of Erlangen and proposed a single organising idea: a geometry is the study of properties left invariant by a group of transformations. Euclidean geometry is the geometry of the rigid-motion group. Affine geometry sits one level up, dropping length and angle but keeping parallelism. Projective geometry sits higher still, keeping only collinearity. The Erlangen Programme is why a graphics engineer in 2026 still works the same hierarchy - isometry, similarity, affine, projective - that Klein laid out a hundred and fifty-four years ago.
Basic Transformations as Matrices
Every frame rendered by Vulkan or Metal walks each vertex through a four-by-four matrix: model, then view, then projection. An RTX 4090 chews through roughly one hundred million of these multiplications per second per pixel-pass, all before a single fragment hits the screen. The whole pipeline ships in every modern UI for free, and the math at the core is one operation: matrix times vector.
Translation, rotation, reflection, and scaling collapse into a single primitive: multiply by a matrix. Stack as many transforms as needed, fold them into one product up front, and the per-vertex cost stays constant. That is exactly why the model-view-projection pipeline is three matrix products in strict order, not a chain of ad-hoc calls. CUDA warps handle thirty-two vertices in lockstep, SIMD lanes on the CPU do four or eight, and the formula on every lane is the same.
| Transform | Two-by-two matrix | Parameters |
|---|---|---|
| Scale | [[sx, 0], [0, sy]] | sx, sy = scale factors |
| Rotation by θ | [[cosθ, -sinθ], [sinθ, cosθ]] | θ = angle, counter-clockwise |
| Reflect across X-axis | [[1, 0], [0, -1]] | Horizontal mirror |
| Shear | [[1, sh], [0, 1]] | sh = shear factor along X |
OpenGL, Vulkan, and Metal compute exactly these matrices: the vertex shader multiplies each vertex by a uniform transform matrix. NVIDIA's CUDA cores process thirty-two vertices per warp on the same formula, and Apple's Metal Performance Shaders ship the same kernels out of the box.
The ninety-degree counter-clockwise rotation matrix sends point (3, 0) to:
Homogeneous Coordinates
Translation refuses to fit a two-by-two matrix. The map f(x) = x + t is not linear: f(0) is t, not zero. Homogeneous coordinates fix this by adding a third slot. Every two-D transform becomes a three-by-three matrix, every three-D transform becomes a four-by-four, and translation joins rotation, scale, and shear inside a single multiplication. That is why ARKit, Tesla Autopilot, and the Chromium GPU compositor all push four-by-four matrices around even on flat input: one shape of operation, zero special cases.
**Two-D homogeneous form:** point (x, y) becomes (x, y, 1) **Translation by (tx, ty):** [[1, 0, tx], [0, 1, ty], [0, 0, 1]] **Full two-D affine matrix:** [[sx·cosθ, -sy·sinθ, tx], [sx·sinθ, sy·cosθ, ty], [0, 0, 1]]
CSS `transform: matrix(a,b,c,d,tx,ty)` is exactly the homogeneous matrix [[a,c,tx],[b,d,ty],[0,0,1]]. Every animated React component on a page recomputes that matrix per frame inside the GPU compositor of Chromium and Safari, alongside other layers.
There is a second prize. Homogeneous coordinates are the native language of projective geometry. Points at infinity get the form (x, y, 0), so parallel lines meet at one ideal point on the horizon. SLAM in computer vision and pose estimation in ARKit lean on this directly: a single four-by-four encodes camera rotation, translation, and perspective at once - no glue code, no special branches.
Homogeneous coordinates are just a trick: pad the vector with a 1
They are the move into projective space, where affine and projective maps both become ordinary linear operators
The ability to write translation as a matrix is a side effect of deeper structure. Projective points sit modulo scale: (x,y,w) ~ (kx,ky,kw). That equivalence class is what makes perspective, camera homographies, and points at infinity speak one language.
Why bring homogeneous coordinates into two-D geometry?
Composition of Transforms
The payoff of the matrix view is composition. The product of N matrices is itself one matrix - the combined transform. Multiply N times once up front, ship the result, then apply it to every point. Robotics has run on this since 1955: Denavit-Hartenberg parameters describe each joint of a manipulator as one four-by-four, and the kinematic chain of an arm is the product of four to six of them. The end of that product is the pose of the gripper.
**Order is load-bearing.** Matrix multiplication does not commute: rotate-then-translate is not the same as translate-then-rotate M_total = M_last · ... · M_2 · M_1 A point transforms as p' = M_total · p Read right to left: the point hits M_1 first, then M_2, and so on out to the leftmost factor.
Computer vision hides the same pattern inside camera extrinsics: the matrix [R|t] carries a point from world frame into camera frame. SLAM in Tesla Autopilot and ARKit optimises exactly these products, shrinking re-projection error across thousands of frames per second using solvers like g2o and Ceres.
Order in a matrix product is just notation - rearrange as convenient
Order is geometry: M_a · M_b means b first, then a. Swapping changes the result everywhere except in narrow commuting cases
A ninety-degree rotation around the origin and a translation by (10, 0) is the textbook example: the two orderings land on different points. A robot that scrambles the order of its DH chain misses the part on the conveyor.
Scaling a sprite around its centre (cx, cy) needs the order:
Affine vs Projective Transforms
Affine maps keep parallel lines parallel and preserve length ratios along any one direction. Projective maps drop parallelism and keep only collinearity: a straight line stays straight, but parallel rails meet at a vanishing point on the horizon. Projective matrices sit at the heart of NeRF (Neural Radiance Fields, 2020): the network ingests photos of a scene, inverts the camera matrices, and reconstructs a three-D radiance field a ray at a time.
| Class | Matrix | Preserves | Example |
|---|---|---|---|
| Isometry | Rotation + translation | Distances, angles | Rigid-body physics |
| Similarity | + scale | Angles, length ratios | Map zoom |
| Affine | + shear | Parallelism, area ratios | CSS transform, 2D sprites |
| Projective | Full 3x3, eight DOF | Collinearity | Camera homography, AR |
Spatial Transformer Networks (Jaderberg, 2015) bake this matrix into the architecture of a convolutional network: the model itself learns the affine parameters and warps the input before classification. The same idea ships drop-in as torchvision.transforms.RandomAffine, the workhorse augmentation step on every ImageNet training run.
Which transform is NOT affine?
Key Ideas
- **Rotation, scale, reflection** fit two-by-two matrices; translation needs homogeneous coordinates
- **Homogeneous form:** (x, y) becomes (x, y, 1); the three-by-three matrix carries translation
- **Composition:** M_total = M_n · ... · M_1, applied right to left
- **Affine** preserves parallelism; **projective** preserves only collinearity
Related Topics
Transformation matrices are the bridge from plane geometry to three-D graphics and projective space:
- Solid Geometry — Three-D: four-by-four matrices in homogeneous coordinates
- Projective Geometry — Homography is a three-by-three with eight degrees of freedom over homogeneous coordinates
- Vector Geometry — Transformation matrices are linear operators acting on vectors
Вопросы для размышления
- Why does order of transformations matter? Sketch a case where rotate-then-translate lands a point in a different place than translate-then-rotate.
- How is the inverse of an affine transform computed, and what does that inverse mean geometrically?
- Why does WebGL keep four-by-four matrices on the wire even when the scene is purely two-D?
Связанные уроки
- la-07-matrix-multiply — Composition is matrix product, right to left
- la-13-linear-maps — Affine maps generalise linear operators
- geo-12 — Homography ships full eight-DOF projective matrix
- ml-29-cnn — Spatial transformer networks plug affine layer in
- cv-05 — Camera extrinsics and SLAM run on these four-by-fours
- la-01-vectors-intro