Compilers
Semantic Errors and Diagnostics
Elm was once known as 'the language with the best compiler error messages in the world'. Its creator Evan Czaplicki wrote on his blog: 'Compiler errors are user interface'. That reframed how the industry looks at diagnostics. Rust, Swift, and Kotlin adopted the approach. A good compiler error saves hours of debugging.
- **Rustc** ships more than 600 unique error codes (E0001-E0799). Each one has a long-form explanation: `rustc --explain E0502` prints a multi-page document with examples and a walkthrough of the borrow checker.
- **TypeScript Language Server** uses structured diagnostics for inline hints in VS Code. The red squiggles with hover tooltips work through the same JSON format that `tsc --strict` produces.
- **Clang** supports `-fdiagnostics-format=sarif` since version 16. That lets GitHub Actions surface compiler warnings as PR comments with no extra scripts.
Error Messages
The quality of error messages is one of the most important user-facing characteristics of a compiler. Early C compilers gave a bare `syntax error` with no line number. Modern compilers like Rust, Elm, and Clang explain the cause, show the context, and often suggest a fix.
The Rust team invested heavily in error messages. Every error type has a unique code (E0502), and `rustc --explain E0502` opens an extended explanation. In 2016 the team ran user studies comparing Rust with Haskell and C++ on error clarity, which drove a large rewrite of the diagnostics layer.
Which of the following matters most for a high-quality compiler error message?
Error Recovery
Error recovery is the compiler's ability to keep analysing after the first error so that it can report as many problems as possible in one run. Without recovery the compiler stops at the first error and the user has to re-run it over and over.
The TypeScript compiler with `--noEmitOnError false` keeps emitting JS even when errors are present. This is critical for IDEs: a language server has to provide completions inside partially broken code. Roslyn (the C# compiler) is built around the principle of always producing a full syntax tree even when there are errors, inserting dedicated ErrorNodes into the AST.
Why does a compiler use a special `ErrorType` during type checking?
Source Spans
A source span is a range of positions in the source code (file, start line, start column, end line, end column). Every AST node carries its own span. That lets the compiler point at the exact fragment of code that caused the error.
The Rust compiler keeps every span in an interned `SourceMap`, a single global structure, instead of holding strings inside each node. That saves memory: instead of carrying a full file path, each span is just 8 bytes (lo + hi). When printing an error, the compiler consults `SourceMap` once to reconstruct the full context.
Why does Rustc store spans as (BytePos, BytePos) rather than (file, line, col)?
Diagnostics Infrastructure
Diagnostics infrastructure is the system that builds and emits error messages. Modern compilers represent each diagnostic as a structured object, not a string, and then render it into different formats: human-readable text, JSON for IDEs, SARIF for CI.
SARIF (Static Analysis Results Interchange Format) is a JSON standard for diagnostics from compilers and linters. GitHub Code Scanning, Azure DevOps, and others consume SARIF, so a compiler or linter can annotate a PR directly. Clang, cppcheck, and ESLint can all emit SARIF. That makes it possible to wire static analysis into code review without extra tooling.
Why does a compiler build diagnostics as structured objects instead of formatting strings directly?
Key ideas
- **Error message quality** is a measurable property of a compiler. Rustc, Elm, and Kotlin invested in it heavily, which affected language adoption.
- **Error recovery** lets the compiler find multiple problems in one run. The `ErrorType` sentinel blocks cascade errors during type checking.
- **Source spans** are stored compactly (byte offsets) on every AST node and are used to pinpoint the location of an error.
- **Structured diagnostics** decouple building from emitting. The same compiler can render to text, JSON (for IDEs), and SARIF (for CI).
Related topics
Diagnostics run through every phase of the compiler, from parsing to code generation:
- Type checking — Type mismatches are the most common source of semantic errors that require high-quality diagnostics
- Symbol table — Undefined variable and redeclaration errors are detected while building the symbol table
- Semantic passes — Diagnostics are collected across all semantic passes and emitted at the end of the phase
Вопросы для размышления
- Why did C compilers historically produce poor error messages? Was it a technical limit or a cultural norm of the era?
- How does error recovery in an IDE (real-time re-parse on every keystroke) differ from batch compilation, and what extra requirements does it impose?
- SARIF lets a static analyser annotate a GitHub PR directly. What new automation possibilities does this open for code review?