Compilers

What is a Compiler

1952. Grace Hopper, a US Navy admiral, writes the first compiler - A-0. Colleagues laugh: "a computer cannot write code." She ignores them. Today V8 compiles JavaScript to machine code on 3 billion Chrome devices. LLVM is the heart of Swift, Rust, and Clang - under the hood of half the new code written in the world. GCC contains 15 million lines. LLVM - 8 million. These programs turn human-readable text into electrical signals that push bits through silicon transistors.

  • **V8 and Node.js** - the same JIT engine powers Chrome and servers; TurboFan compiles hot functions into optimized machine code
  • **LLVM** - under the hood of Swift (Apple), Rust, Clang, Julia; write a frontend for a new language and get all platforms for free
  • **Python bytecode** - `.py` is compiled to `.pyc` on first run; CPython interprets bytecode, PyPy JIT-compiles it
  • **Interviews** - questions about JIT vs AOT, compile time vs runtime, AST - standard at Senior-level interviews in Big Tech

Compilation vs interpretation

1952. Grace Hopper, a US Navy admiral, writes the first compiler - A-0. Colleagues laugh: "a computer cannot write code." She ignores them. Today V8 compiles JavaScript to machine code billions of times a day, running on 3 billion devices with Chrome. LLVM is the heart of Swift, Rust, and Clang - powering roughly half of all new code written in the world. Between the text a programmer writes and the bits a processor moves stands a **translator**. Two fundamentally different approaches exist for that translation.

**Compilation** - translating the entire program *before* it runs. Think of translating a book: the translator processes the whole text, and then the reader gets the finished result. C, C++, Rust, Go are compiled languages. The output is an executable file (`.exe`, ELF binary) that runs without the source code.

**Interpretation** - translating line by line *while* it runs. Like a simultaneous interpreter: hear a phrase, translate it immediately. Python, Ruby, PHP are interpreted. The output requires an interpreter on the machine to run the code.

**JIT (Just-In-Time)** is a hybrid approach. The code is first compiled to intermediate bytecode, and then the JIT compiler translates "hot" sections into machine code at runtime. Java (HotSpot JVM), JavaScript (V8, SpiderMonkey), and C# (.NET CLR) all use JIT. V8 TurboFan compiles hot functions with type assumptions - and deoptimizes when an assumption is violated.

PropertyCompilationInterpretationJIT
Execution speedMaximumLowHigh (after warm-up)
Startup timeInstantInstantSlow (JIT warm-up)
Error detectionBefore runAt runtimeMixed
PortabilityPer-platform binaryWherever interpreter existsWherever VM exists
DebuggingHarder (no source)Easier (source available)Medium
ExamplesC, Rust, GoPython, Ruby, PHPJava, JS V8, C#

The boundary between compilation and interpretation is blurry. Python compiles `.py` to `.pyc` bytecode (compilation!), which is then interpreted by the CPython VM. V8 first interprets JavaScript through Ignition and then JIT-compiles through TurboFan.

**Compile time** - when the compiler analyzes the source code. **Runtime** - when the program is actually running. Many optimizations are a trade-off between the two: spend more time compiling to make the program run faster.

What approach does the V8 JavaScript engine in Chrome use?

Compiler phases

A compiler doesn't translate code in one sweep. It's a **pipeline** of distinct phases, each doing exactly one job and passing its result to the next. Consider a factory: raw material - machining - assembly - painting - packaging. Each station is specialized. The order cannot be violated: painting before assembling is impossible, packaging before painting is impossible.

**Lexical analysis (Lexing/Tokenization)** - splits the character stream into tokens: keywords, identifiers, numbers, operators. Like splitting a sentence into words.

**Syntactic analysis (Parsing)** - checks grammar and builds an AST (Abstract Syntax Tree). Like parsing a sentence: subject, predicate, object.

**Semantic analysis** - checks meaning: do the types match? Are variables declared? Are there name conflicts? Like verifying that a sentence is not only grammatically correct but also makes sense.

**Optimization** - improves the code without changing its behavior: constant folding (`2 + 3` → `5`), dead code elimination, loop unrolling. This is exactly where Rust achieves zero-cost abstractions.

**Code generation** - translates the optimized representation into machine code for the target architecture (x86, ARM, RISC-V).

Phase order matters! Optimization cannot happen before semantic analysis - that would be optimizing code with errors. Code generation cannot happen before optimization - that leaves performance on the table.

At which phase will the compiler catch the error `int x = "hello";` in C?

Compiler frontend and backend

Compiler phases naturally split into two groups. The **frontend** works with the source language: lexer, parser, semantic analysis. The **backend** works with the target platform: optimization, code generation. Between them sits the **intermediate representation (IR)**. This split is what made LLVM a revolution.

Without the split: 5 languages × 4 platforms = 20 compilers. With a shared IR: 5 frontends + 1 optimizer + 4 backends = 10 components. That is the genius of the **LLVM** architecture.

**LLVM** (Low Level Virtual Machine) is not a virtual machine - it's a collection of libraries for building compilers. Created by Chris Lattner in 2003 as a project at the University of Illinois. Today LLVM underpins the compilers for Clang (C/C++), rustc (Rust), Swift, Julia, and dozens of others.

CompilerFrontendBackendLanguage
ClangClangLLVMC / C++ / Objective-C
rustcrustcLLVMRust
swiftcSwift compilerLLVMSwift
GCCGCC frontendGCC backendC / C++ / Fortran
javacjavacJVM bytecodeJava
go buildGo compilerGo SSA backendGo

Creating a new programming language with LLVM requires writing only the frontend (lexer + parser + semantics + LLVM IR emitter). LLVM handles optimization and machine code generation for all platforms.

Why is LLVM used as the backend for so many languages?

Bootstrapping: the chicken-and-egg problem

The C compiler is written in C. The Rust compiler is written in Rust. The Go compiler is written in Go. This raises a paradox: **how is a compiler compiled if the compiler doesn't yet exist?** This is the bootstrapping problem - and it's not philosophical. Every language goes through it.

**Self-hosting** is the key milestone. When a compiler can compile its own source code, it is considered self-sufficient. Rust's bootstrap chain: early versions were compiled by an OCaml compiler. Go before version 1.5 was written in C, then rewritten in Go.

Ken Thompson's "Reflections on Trusting Trust" (1984)

Ken Thompson, creator of Unix and the B language, received the Turing Award in 1983. In his Turing lecture he demonstrated an attack: a backdoor can be embedded in a C compiler such that it reproduces itself on every recompilation - even when the backdoor is absent from the compiler's source code. The compiler "teaches" its next version. The moral: no code can be fully trusted without independent verification - including the compiler.

Thompson's attack is not theoretical. In 2003, an attempt to inject a backdoor into the Linux kernel was discovered via a patch masquerading as an access-rights check. The Reproducible Builds project combats such threats: the same source code must produce an identical binary.

Modern languages and their bootstrap chains: **Rust** - OCaml → Rust. **Go** - C → Go (since version 1.5). **GHC (Haskell)** - C → Haskell. **TypeScript** - JavaScript → TypeScript. Each new self-hosting compiler is a milestone for the language.

Interpreted languages are always slow; compiled languages are always fast

The boundary is blurry: JIT compilation in V8 makes JavaScript competitive with C++ on certain workloads, while naive C code without optimization can run slower than optimized Python with NumPy

V8 TurboFan performs speculative optimization: it guesses variable types, generates optimized machine code, and deoptimizes only when a guess is wrong. LuaJIT reaches 70-80% of C speed. PyPy accelerates Python 5-10x via tracing JIT. Performance depends on the implementation, not the category of language.

What is a self-hosting compiler?

Key ideas

  • **Compilation** translates all code before running (C, Rust); **interpretation** translates line by line at runtime (Python, Ruby); **JIT** is a hybrid (Java, V8): fast startup + hot-path optimization
  • A compiler is a 5-phase pipeline: lexical analysis → syntax analysis → semantic analysis → optimization → code generation; phase order cannot be violated
  • The **frontend** (language-specific) and **backend** (platform-specific) are connected through an intermediate representation (IR). LLVM is a universal backend for dozens of languages: N + M instead of N × M
  • **Bootstrapping** resolves the chicken-and-egg paradox: first version in another language, then self-hosting. Thompson's attack shows that a compiler binary can never be fully trusted

Related topics

Compilers are the foundation of the entire programming ecosystem. Now it's clear what happens between source code and machine instructions:

  • How a computer reads code — A detailed walkthrough of each phase: from characters to machine code
  • Compiler architecture — Passes, pipeline, multi-pass vs single-pass - internal organization
  • Formal languages — Grammars, automata - the mathematics underpinning lexers and parsers

Вопросы для размышления

  • Why do most new languages (Rust, Swift, Zig) choose LLVM as their backend rather than writing their own? When would it make sense to write a custom backend?
  • If JIT can generate code competitive with C++, why are server applications often written in Go or Rust rather than JavaScript?
  • Thompson's attack shows that no compiler binary can be blindly trusted. How does the Reproducible Builds project address this? Is it sufficient?

Связанные уроки

  • fl-01-intro — Lexers and parsers are built on formal languages and grammars
  • alg-01-big-o — Algorithmic complexity matters for understanding compiler optimization passes
  • plt-01-paradigms — Programming paradigms determine what the compiler actually compiles
  • dm-01 — Discrete math is the foundation of formal semantics and type systems
What is a Compiler

0

1

Sign In