Compilers

Intermediate Representation (IR)

Clang, Rust, Swift, Kotlin, Julia: different languages with different syntax and type systems. Yet all of them produce the same LLVM IR. That is possible thanks to the intermediate representation, the abstraction layer between language and hardware. IR is the lingua franca of the compiler world.

**LLVM IR** is used by more than 30 languages: Clang (C/C++), rustc, swiftc, kotlinc/native, flang (Fortran), Julia, LDC (D), Crystal. A single backend supports x86, ARM, RISC-V, and WebAssembly, exactly because of the shared IR.
**GCC GIMPLE** is GCC's own IR, higher-level than LLVM IR. Compiling with `-fdump-tree-all` makes GCC dump every intermediate representation. You can watch the source transform through 20+ phases.
**WebAssembly** is itself an IR: stack-based bytecode that browsers compile to native code via their own JITs (V8 TurboFan, SpiderMonkey IonMonkey, JavaScriptCore B3).

Three-Address Code

Three-address code (TAC) is an intermediate representation in which every instruction has at most three operands: result = operand1 op operand2. This simplifies optimisations drastically. Every computation has a name and side effects are isolated.

LLVM IR is the best-known three-address IR in SSA form. It is typed (each operand carries a type: `i32`, `float`, `ptr`), strictly SSA (every name is defined exactly once), and aimed at machine-independent optimisations. Clang, rustc, swiftc, kotlinc, and Julia all share LLVM IR as a common backend.

Why is three-address code convenient for optimisations?

SSA Form

SSA (Static Single Assignment) is an IR form in which every variable is defined exactly once. Branches merge through phi functions that join values from different paths. SSA dramatically simplifies dataflow analysis and most optimisations.

LLVM IR is always in SSA form. GCC switches to SSA at the GIMPLE level (the `gimple_convert_to_ssa` function). Java bytecode and the JVM JIT (HotSpot C2) also build SSA for optimisations. SSA was invented at IBM in 1988 (Cytron et al.). Before that, optimisers used more complex dataflow analysis. The industry took about ten years to migrate.

What is the phi function in SSA for?

Control Flow Graph

A Control Flow Graph (CFG) is a directed graph where nodes are basic blocks (straight-line sequences of instructions with no branches) and edges are possible transitions between blocks. The CFG makes every possible execution path explicit.

LLVM IR represents the CFG explicitly. Every function is a list of basic blocks, and every block ends with a terminator instruction (br, ret, switch, invoke). Rustc MIR is also an explicit CFG with `BasicBlock` and `Terminator`. The CFG is the basis for most optimisations: dead code elimination looks for unreachable blocks, and loop detection looks for cycles (strongly connected components in the CFG).

What are the nodes of a CFG?

Basic Blocks

A basic block is a maximal straight-line sequence of instructions such that control only enters at the start and only leaves at the end. There are no conditional branches inside a block. It is the unit of local optimisation: local optimisations work within a single block.

LLVM IR checks basic block correctness through `verifyFunction()`. If a function fails verification, LLVM aborts with an assertion. That is invaluable when building a new backend: if the generated IR is malformed (e.g. a phi not at the start of a block, or a block missing a terminator), the verifier points right at the problem.

Can a basic block contain a conditional branch in the middle?

Key ideas

**Three-address code** is an IR where each instruction is `result = op1 operator op2`. Every subexpression has a name. That is the foundation for optimisations.
**SSA form** guarantees that every variable is defined exactly once. Phi functions appear at path joins. It simplifies dataflow analysis and constant propagation.
**CFG** is a graph of basic blocks (nodes) and branches (edges). It makes every execution path of the program explicit.
**Basic block** is a straight-line sequence with no internal branches. It starts at a leader and ends with a terminator.

Вопросы для размышления

LLVM IR is typed (`i32`, `float`, `ptr`), while GCC RTL is not. What optimisations become possible once types are present in the IR?
Is WebAssembly an IR or an ISA? Make a case either way. It has properties of both.
Why was SSA invented so late (1988) when dataflow analysis dated from the 1960s? What was hard about the transition?

Связанные уроки

plt-27-ir-optimization

Three-Address Code

Why is three-address code convenient for optimisations?

SSA Form

What is the phi function in SSA for?

Control Flow Graph

What are the nodes of a CFG?

Basic Blocks

Can a basic block contain a conditional branch in the middle?

Key ideas

**Three-address code** is an IR where each instruction is `result = op1 operator op2`. Every subexpression has a name. That is the foundation for optimisations.

**SSA form** guarantees that every variable is defined exactly once. Phi functions appear at path joins. It simplifies dataflow analysis and constant propagation.

**CFG** is a graph of basic blocks (nodes) and branches (edges). It makes every execution path of the program explicit.

**Basic block** is a straight-line sequence with no internal branches. It starts at a leader and ends with a terminator.

Intermediate Representation (IR)

Three-Address Code

SSA Form

Control Flow Graph

Basic Blocks

Key ideas

Related topics

Вопросы для размышления

Связанные уроки

Intermediate Representation (IR)

Three-Address Code

SSA Form

Control Flow Graph

Basic Blocks

Key ideas

Related topics

Вопросы для размышления

Связанные уроки