Computer Architecture
Superscalar: Multiple Instructions Per Cycle
Цели урока
- Understand the principle of superscalar execution and IPC > 1
- Know the role of multiple execution units
- Understand Out-of-Order execution and the Reorder Buffer
- Know Register Renaming for eliminating false dependencies
- Understand Speculative Execution
Предварительные знания
- Pipelining
- Hazards
- Branch Prediction
Modern processors execute instructions out of order compared to how they are written in the program. They reorder, speculate, and parallelize - all in pursuit of speed.
- Understanding multithreaded code performance
- Memory barriers in concurrent programming
- Spectre/Meltdown vulnerabilities
- Optimizing for specific microarchitectures
From IPC=1 to IPC>1
A **superscalar processor** executes multiple instructions per clock cycle using multiple execution units.
| Processor | Issue Width | Year |
|---|---|---|
| Intel Pentium | 2 | 1993 |
| PowerPC 970 | 4 | 2002 |
| Intel Core | 4 | 2006 |
| Apple M1 (P-core) | 8 | 2020 |
| Apple M2 (P-core) | 8 | 2022 |
**Issue Width** - how many instructions the CPU can dispatch per cycle. M1 has issue width = 8, but real-world IPC is ~3-4 due to dependencies.
What does superscalar mean?
Execution Units
To execute 4 instructions at once, 4 execution units are required:
| Execution Unit | Operations | Count |
|---|---|---|
| Integer ALU | ADD, SUB, AND, OR, XOR | 2-4 |
| Load Unit | Memory reads | 2 |
| Store Unit | Memory writes | 1-2 |
| FPU | Floating-point operations | 2 |
| Branch Unit | Branches | 1-2 |
| SIMD/Vector | AVX/SSE operations | 2 |
**Limitation:** Even with 8 ALUs, if all instructions depend on each other, IPC = 1. Parallelism must exist in the code itself!
What limits the real-world IPC of a superscalar processor?
Out-of-Order Execution
**Problem:** Instructions in a program are dependent on each other. But further in the queue there may be independent instructions!
**Out-of-Order (OoO):** Reorder and execute MUL while ADD is still computing!
**Reorder Buffer (ROB):** Stores instructions in program order. Results are committed to architectural registers in the correct order, even if execution was out of order.
What is the Reorder Buffer (ROB) for?
Register Renaming
**False dependencies:** Sometimes a dependency exists only in the register name, not in the actual data.
**Register Renaming:** Rename R1 to different physical registers!
| Dependency | Type | Solution |
|---|---|---|
| RAW (Read After Write) | True | Forwarding, OoO |
| WAW (Write After Write) | False | Register Renaming |
| WAR (Write After Read) | False | Register Renaming |
**Physical registers:** x86-64 has 16 architectural registers, but ~200 physical registers for renaming!
Register Renaming eliminates:
Speculative Execution
**Speculation:** Execute instructions ahead of time without knowing whether they will be needed.
**If the prediction is correct:** Results are committed, everything is fine.
**If the prediction is wrong:** Speculative results are flushed and rolled back.
**Spectre/Meltdown:** These vulnerabilities are based on the fact that speculative execution leaves traces in the cache even after rollback! This allows reading protected memory.
A superscalar CPU is always N times faster than a scalar one
The actual speedup is limited by dependencies in the code. Typical IPC is 2-4, not 8.
Even with 8 execution units, if the code is sequential, there is no parallelism to exploit.
What happens on a wrong speculation?
Key Ideas
- Superscalar: multiple instructions per cycle (IPC > 1)
- Execution Units: multiple ALUs, Load, Store, FPU
- Out-of-Order: reordering for maximum parallelism
- Register Renaming: eliminates WAW/WAR dependencies
- Speculation: execution along the predicted branch path
- Real IPC is limited by data dependencies in the code
Related Topics
Superscalar execution is the pinnacle of CPU evolution.
- RISC vs CISC — RISC is easier to superscalarize
- Cache memory — Data must be in cache for high IPC
Вопросы для размышления
- Why does superscalar execution yield smaller gains on code with many data dependencies?
- How does out-of-order execution help hide memory latency in superscalar processors?
- What is the instruction-level parallelism wall, and why does it limit superscalar scaling?