Compilers
Instruction Scheduling
A LOAD from L1 cache takes 4 cycles. If the next instruction uses the loaded data, 3 cycles are idle. A smart scheduler slots independent instructions into that gap, and the program speeds up with no algorithmic change. On an in-order CPU (ARM Cortex-M) it is the difference between 6 and 2 cycles per iteration.
- **Apple M1/M2** has a 630-entry ROB vs 224 on Intel Skylake. That hides memory latencies of 100-300 cycles (L3 cache miss) by running other work. Benchmarks show the M1 is particularly effective on workloads with irregular memory access patterns.
- **GCC's `sched.cc`** is a 6000-line scheduler with support for list scheduling, modulo scheduling for loops, and speculative scheduling. It contains latency models for 50+ architectures.
- **Intel Itanium** (the EPIC architecture) was fully in-order VLIW. The compiler had to mark parallel instructions explicitly via bundle slots. ICC (Intel C Compiler) for Itanium reached 95% of peak FLOPS, an unrealistic number for a general-purpose compiler.
Pipeline Hazards
A CPU pipeline executes several instructions simultaneously at different stages. A pipeline hazard is a situation where the next instruction cannot start immediately because it depends on the result of a previous one. That causes a stall (pipeline halt), wasted cycles.
Modern x86 CPUs (Intel Skylake+) implement out-of-order execution with a 200+ entry reorder buffer. The CPU itself reorders instructions during execution. The compiler still schedules: well-scheduled code leaves less work for OOO, which reduces power and improves performance predictability. On in-order CPUs (ARM Cortex-M, embedded) the compiler's scheduling is critical.
What is a RAW hazard?
Instruction Reordering
Instruction scheduling reorders instructions to hide latency. If a LOAD takes 4 cycles, you can run 4 independent instructions while waiting for the result. The scheduler builds a dependency DAG and topologically sorts it with latency in mind.
GCC uses `sched.cc`, a 6000-line scheduler with latency models for every architecture. LLVM uses `MachineScheduler.cpp`. Latencies come from the machine description (`SchedModel` in `.td` files). For ARM Cortex-A72 LLVM ships `AArch64SchedA72.td`, 2000 lines describing the latency of every instruction on every execution port.
Why does the scheduler move LOAD instructions toward the start of a block?
Software Pipelining
Software pipelining is a loop transformation where instructions from different iterations overlap in time. It hides latency by overlapping the loop with its own next pass. It is the software equivalent of a hardware pipeline.
Software pipelining matters especially for VLIW (Very Long Instruction Word) architectures, where there is no hardware OOO. Intel Itanium (IA-64) was a VLIW CPU and depended entirely on the compiler (Intel ICC) for pipelining. Modulo scheduling is the software pipelining algorithm for loops: it finds the minimum initiation interval (II) such that all dependencies are satisfied.
What is the initiation interval (II) in software pipelining?
Out-of-Order and Superscalar
Out-of-Order (OOO) execution lets the CPU reorder instructions at runtime using hardware register renaming and a reorder buffer (ROB). Superscalar means several instruction decode/execute units run in parallel. Both mechanisms hide latency in hardware.
Apple M1/M2 has a 630-entry ROB (vs Skylake's 224). The larger buffer hides more latency. That is one reason for the M1's outstanding performance on L2/L3 cache misses (100+ cycles): the M1 can keep hundreds of instructions in flight while waiting on memory. A compiler that knows the OOO model of the CPU can emit code that exploits the parallelism better. LLVM uses `SchedMachineModel` with port information.
What does register renaming give an OOO CPU?
Summary
- **Pipeline hazards** (RAW, WAR, WAW) cause stalls. RAW is the most common: reading a register before the previous instruction has written it.
- **List scheduling** is a greedy topological walk of the dependency DAG, prioritised by critical path. It hides latency by inserting independent instructions.
- **Software pipelining** overlaps loop iterations: the LOAD for iteration i+1 runs while iteration i is being computed. It gives a 2x-4x speedup on in-order CPUs.
- **OOO + superscalar** hide hazards in hardware. Register renaming eliminates WAR/WAW. The compiler still schedules, for energy efficiency and for in-order CPUs.
Related topics
Instruction scheduling is the final optimisation phase before asm emission:
- Instruction selection — The scheduler runs after instruction selection. It reorders the chosen machine instructions
- Target architectures — Latencies and execution ports depend on the target architecture. The scheduler uses the machine model
- Loop optimisations — Software pipelining is a special case where loop optimisation and instruction scheduling meet
Вопросы для размышления
- Skylake has 8 execution ports. Optimal code should load all ports evenly. How does the compiler know which instructions hit which port, and can code ever reach the theoretical peak?
- Software pipelining needs a prologue and an epilogue to ramp up and wind down the loop. If N (the iteration count) is small, the overhead outweighs the gain. How does the compiler decide whether to apply software pipelining?
- Modern CPUs predict branches with >99% accuracy. How does that interact with instruction scheduling? Should code be scheduled separately in both branches or as a single stream?