Compilers
JVM: Architecture and Bytecode
In 1995 Sun shipped Java with the slogan 'Write Once, Run Anywhere'. The JVM delivered on it: a .class file runs on any JVM, Windows, Linux, macOS, Solaris. In 2024 the JVM runs not only Java but also Kotlin, Scala, Groovy, Clojure, and JRuby. All compile to the very same .class format.
- **HotSpot JVM** from Oracle powers most production Java applications. Adaptive JIT compilation (C1 + C2): C1 quickly compiles hot code, C2 aggressively optimises the hottest code. Large microservice platforms (Kafka, Elasticsearch, Cassandra) all run on HotSpot.
- **GraalVM** is an alternative JVM with Native Image: AOT compilation of Java into a native binary via SubstrateVM. Spring Boot on GraalVM Native: start-up in 50ms instead of 3-5 seconds, memory 5x smaller. Used in serverless (AWS Lambda, Quarkus).
- **Kotlin** compiles to the same JVM bytecode as Java, with full interoperability. Kotlin coroutines compile to a state machine (suspend fun -> invokedynamic-based continuation). Kotlin/Native and Kotlin/JS compile to LLVM IR and JS. One language, three platforms.
Class File Format
A Java .class file is a binary format that describes a single class or interface. The structure is strictly standardised in the JVM Specification. Any JVM (HotSpot, GraalVM, Dalvik) must read .class files correctly. That is the basis of 'write once, run anywhere'.
The constant pool is the key structure of a .class file: every string, class name, and method signature is stored there once, and the code references it by index. That is a form of compression: the name `java/lang/String` may appear hundreds of times in the code but is stored once. 0xCAFEBABE was chosen by the Gosling team as an inside joke. They often had lunch at the Grateful Dead's Mojo Cafe (the full name had to be shortened).
What does the magic number `0xCAFEBABE` at the start of a .class file mean?
JVM Bytecode
JVM bytecode is a stack-based set of 256 opcodes (0x00-0xFF). Each method has a Code attribute with bytecode, max_stack (peak stack depth), and max_locals (number of local variables). Types are baked into the opcodes: `i` = int, `l` = long, `f` = float, `d` = double, `a` = reference.
invokedynamic (Java 7, 2011) was a revolution in JVM bytecode. It lets languages (Groovy, Scala, Kotlin, dynamic Java) implement calls with arbitrary dispatch semantics via MethodHandle. Without invokedynamic, a Java lambda compiled into an anonymous class (1000 bytes for `() -> x + 1`). With invokedynamic, a lambda is just one invokedynamic with a bootstrap method and no separate class.
Why does the JVM have `iload` (int) and `aload` (reference) instead of one `load`?
Class Loading
Class loading dynamically loads .class files on first use. The JVM does not load every class at start-up, only the ones actually needed. This is lazy loading: import java.util.* does not load LinkedList if it is never used.
The HotSpot JVM has Class Data Sharing (CDS): frequently used classes (java.*, javax.*) are serialised into a shared archive (classes.jsa) at JDK install time. When a program starts, the archive is memory-mapped. The classes are not read from disk and not re-verified. That speeds start-up by 20-40%. AppCDS extends this to user classes. A Spring Boot app starts 2x faster with AppCDS.
Why does the JVM use lazy class loading (on first use) instead of eager (at start-up)?
Bytecode Verification
The bytecode verifier is the JVM component that validates the bytecode before execution: type safety, no stack overflows, and correct jumps. Without verification, the JVM would not be safe for running untrusted code (applets, OSGi, Tomcat).
Stack Map Frames (Java SE 6, JSR 202) are bytecode annotations that explicitly carry stack types at each jump target. The Java SE 8+ verifier only checks consistency with these frames, it does not compute them itself. That cut verification time from O(N^2) to O(N). Android (Dalvik/ART) takes a similar approach in the dex format. GraalVM adds a specialised verifier that runs 3x faster than HotSpot on cold start.
Why does the JVM verify bytecode even when the code was written in Java (a trusted compiler)?
Summary
- **.class format** contains the constant pool, methods with bytecode, and attributes. Magic `0xCAFEBABE`. Strictly standardised. Every JVM reads it the same way.
- **JVM bytecode** is stack-based and typed (iadd vs fadd vs ladd). invokedynamic (Java 7) opened the JVM to dynamic languages and lambdas.
- **Class loading** is lazy: only on first use. A hierarchy of ClassLoaders. CDS speeds start-up via a shared archive.
- **Verification** is dataflow analysis of stack types before execution. It guarantees safety regardless of the .class source. Stack Map Frames: O(N).
Related topics
The JVM is a complete implementation of all managed runtime concepts:
- Bytecode and virtual machines — The JVM is a concrete implementation of a stack-based VM with a verifier and a JIT
- V8: JavaScript — V8 solves similar problems (JIT, deoptimisation) for a dynamic language. An interesting comparison of approaches
- Linking and loading — Class loading is the managed-environment counterpart of dynamic linking, with verification
Вопросы для размышления
- GraalVM Native Image compiles Java AOT into a native binary. That conflicts with 'Write Once, Run Anywhere': a Linux x86-64 binary will not run on macOS ARM64. What trade-offs does the team make when choosing between AOT (Native Image) and traditional JIT?
- invokedynamic lets you implement arbitrary dispatch semantics. How does Kotlin use invokedynamic for coroutines? What exactly is replaced and what overhead remains?
- The JVM verifier checks stack types statically. After JIT compilation verification is no longer needed. How does the JVM handle deoptimisation (going back to interpreted code after JIT)? Does re-verification have to happen?