Compilers
Loop Optimisations
A naive 1024x1024 matrix transpose runs in 50ms. With loop tiling, 5ms. With tiling + SIMD, 1ms. Same algorithm, same O(N^2) complexity, just a different memory access order. Loop optimisations deliver 10x-50x speedups on number crunching, and this is where a compiler or a performance engineer earns their keep.
- **GCC Graphite** and **LLVM Polly** are polyhedral optimisers in production compilers. Polly compiling BLAS-like code gives 5x-20x speedups through automatic tiling and parallelisation.
- **PyTorch and TensorFlow** compile NN operations via TVM or XLA, essentially polyhedral compilers specialised for tensor operations. FlashAttention written in Triton uses manual tiling and beats naive PyTorch attention by 2x-4x.
- **Intel MKL** and **OpenBLAS** are hand-written DGEMM implementations with manual tiling, SIMD, and prefetching. Intel MKL reaches 95%+ of theoretical peak FLOPs on Xeon precisely because the loop transformations are tuned to the specific L1/L2/L3 caches.
Loop Transformations
Loop transformations are a family of optimisations that reshape a loop's structure to improve performance. The key transformations are interchange (swap nested loops), tiling (block decomposition), fusion (merge), and fission (split).
Matrix multiplication (DGEMM) is the canonical benchmark for loop tiling. A naive implementation on 1024x1024 matrices runs at about 2 GFLOPS. With loop tiling + SIMD: roughly 100 GFLOPS (50x). OpenBLAS and Intel MKL ship hand-tuned tiled loops with SIMD. Polly (LLVM) applies polyhedral transformations automatically and delivers 5x-20x speedups on dense computations.
Why does loop interchange speed up matrix code in C?
Loop Unrolling
Loop unrolling duplicates the loop body several times. It reduces the iteration count and the loop-control overhead (induction variable update, branch). It also unlocks instruction-level parallelism.
Full unrolling expands every iteration and removes the loop entirely. It only applies when the iteration count is known at compile time and small. GCC and Clang automatically unroll loops with up to 8-16 iterations. Rust's `#[unroll]` proc macro and `std::hint::unreachable_unchecked()` give the programmer explicit control. Overdoing unrolling hurts icache performance.
What is the main risk of aggressive loop unrolling?
Auto-Vectorization
Auto-vectorisation automatically converts scalar operations into SIMD instructions (SSE, AVX on x86; NEON on ARM). Instead of processing one element per iteration, the loop now processes 4, 8, or 16 elements at once.
GCC `-fopt-info-vec-missed` reports every loop that was NOT vectorised, and why. Typical reasons: unknown loop count, data dependency, non-unit stride. NumPy and PyTorch use SIMD via hand-written intrinsics or via Eigen/xtensor. The LLVM Loop Vectorizer works at the IR level and supports interleaved access patterns, masked vectorisation (for loops with a condition), and SLP (superword-level parallelism) for non-loop operations.
What prevents vectorisation of `for (i..n) c[i] = c[i-1] + a[i]`?
Polyhedral Model
The polyhedral model is a mathematical framework for analysing and transforming nested loops. A loop's iteration space is represented as a polyhedron, and transformations are linear mappings of that space.
TVM (Apache) and Triton (OpenAI) use a polyhedral approach to generate efficient GPU kernels for neural networks. Triton lets you write kernels in a Python-like syntax, and the compiler applies tiling, vectorisation, and memory coalescing automatically. FlashAttention v2 is implemented in Triton and is 2x-4x faster than the standard PyTorch attention thanks to aggressive tiling.
What does the polyhedron represent in the polyhedral loop model?
Key ideas
- **Loop interchange** swaps the order of nested loops for cache-friendly access. Row-major vs column-major can differ by 10x-100x on large matrices.
- **Loop unrolling** expands the loop body to reduce overhead and expose ILP. Risk: code bloat and icache misses.
- **Auto-vectorisation** replaces scalar ops with SIMD: 4-16 elements per iteration. Requires no loop-carried dependency and no aliasing.
- **Polyhedral model** formalises loop transformations as linear mappings of the iteration space. It is the basis of Polly, TVM, and Triton.
Related topics
Loop optimisations are the core of high-performance computing:
- Global optimisations — LICM is the first loop optimisation; dataflow analysis is used to detect loop-carried dependencies
- Instruction selection — Vectorised code needs special instruction selection for SIMD instructions
- Instruction scheduling — After unrolling, the scheduler hides latency by reordering instructions across iterations
Вопросы для размышления
- The optimal block size for loop tiling depends on L1/L2 cache size. How does an automatic optimiser (Polly, GCC Graphite) learn the cache size, and what happens with cross-platform compilation?
- GPUs have a different memory hierarchy from CPUs (shared memory, warp execution). Do the same loop transformations (tiling, interchange) apply to GPU kernels, or are other approaches needed?
- Predicated SIMD (AVX-512 masked operations) lets you vectorise loops with an inner condition. How does the compiler generate predicate masks from an if-condition in the loop body?