Computer Architecture

Superscalar: Multiple Instructions Per Cycle

Цели урока

  • Understand the principle of superscalar execution and IPC > 1
  • Know the role of multiple execution units
  • Understand Out-of-Order execution and the Reorder Buffer
  • Know Register Renaming for eliminating false dependencies
  • Understand Speculative Execution

Предварительные знания

  • Pipelining
  • Hazards
  • Branch Prediction
  • Pipelining

Modern processors execute instructions out of order compared to how they are written in the program. They reorder, speculate, and parallelize - all in pursuit of speed.

  • Understanding multithreaded code performance
  • Memory barriers in concurrent programming
  • Spectre/Meltdown vulnerabilities
  • Optimizing for specific microarchitectures

From IPC=1 to IPC>1

A **superscalar processor** executes multiple instructions per clock cycle using multiple execution units.

ProcessorIssue WidthYear
Intel Pentium21993
PowerPC 97042002
Intel Core42006
Apple M1 (P-core)82020
Apple M2 (P-core)82022

**Issue Width** - how many instructions the CPU can dispatch per cycle. M1 has issue width = 8, but real-world IPC is ~3-4 due to dependencies.

What does superscalar mean?

Execution Units

To execute 4 instructions at once, 4 execution units are required:

Execution UnitOperationsCount
Integer ALUADD, SUB, AND, OR, XOR2-4
Load UnitMemory reads2
Store UnitMemory writes1-2
FPUFloating-point operations2
Branch UnitBranches1-2
SIMD/VectorAVX/SSE operations2

**Limitation:** Even with 8 ALUs, if all instructions depend on each other, IPC = 1. Parallelism must exist in the code itself!

What limits the real-world IPC of a superscalar processor?

Out-of-Order Execution

**Problem:** Instructions in a program are dependent on each other. But further in the queue there may be independent instructions!

**Out-of-Order (OoO):** Reorder and execute MUL while ADD is still computing!

**Reorder Buffer (ROB):** Stores instructions in program order. Results are committed to architectural registers in the correct order, even if execution was out of order.

What is the Reorder Buffer (ROB) for?

Register Renaming

**False dependencies:** Sometimes a dependency exists only in the register name, not in the actual data.

**Register Renaming:** Rename R1 to different physical registers!

DependencyTypeSolution
RAW (Read After Write)TrueForwarding, OoO
WAW (Write After Write)FalseRegister Renaming
WAR (Write After Read)FalseRegister Renaming

**Physical registers:** x86-64 has 16 architectural registers, but ~200 physical registers for renaming!

Register Renaming eliminates:

Speculative Execution

**Speculation:** Execute instructions ahead of time without knowing whether they will be needed.

**If the prediction is correct:** Results are committed, everything is fine.

**If the prediction is wrong:** Speculative results are flushed and rolled back.

**Spectre/Meltdown:** These vulnerabilities are based on the fact that speculative execution leaves traces in the cache even after rollback! This allows reading protected memory.

A superscalar CPU is always N times faster than a scalar one

The actual speedup is limited by dependencies in the code. Typical IPC is 2-4, not 8.

Even with 8 execution units, if the code is sequential, there is no parallelism to exploit.

What happens on a wrong speculation?

Key Ideas

  • Superscalar: multiple instructions per cycle (IPC > 1)
  • Execution Units: multiple ALUs, Load, Store, FPU
  • Out-of-Order: reordering for maximum parallelism
  • Register Renaming: eliminates WAW/WAR dependencies
  • Speculation: execution along the predicted branch path
  • Real IPC is limited by data dependencies in the code

Related Topics

Superscalar execution is the pinnacle of CPU evolution.

  • RISC vs CISC — RISC is easier to superscalarize
  • Cache memory — Data must be in cache for high IPC

Вопросы для размышления

  • Why does superscalar execution yield smaller gains on code with many data dependencies?
  • How does out-of-order execution help hide memory latency in superscalar processors?
  • What is the instruction-level parallelism wall, and why does it limit superscalar scaling?

Связанные уроки

  • os-01-intro
Superscalar: Multiple Instructions Per Cycle

0

1

Sign In