Computer Architecture

Superscalar: Multiple Instructions Per Cycle

Цели урока

Understand the principle of superscalar execution and IPC > 1
Know the role of multiple execution units
Understand Out-of-Order execution and the Reorder Buffer
Know Register Renaming for eliminating false dependencies
Understand Speculative Execution

Предварительные знания

Pipelining
Hazards
Branch Prediction

Pipelining

Modern processors execute instructions out of order compared to how they are written in the program. They reorder, speculate, and parallelize - all in pursuit of speed.

Understanding multithreaded code performance
Memory barriers in concurrent programming
Spectre/Meltdown vulnerabilities
Optimizing for specific microarchitectures

From IPC=1 to IPC>1

A **superscalar processor** executes multiple instructions per clock cycle using multiple execution units.

Processor	Issue Width	Year
Intel Pentium	2	1993
PowerPC 970	4	2002
Intel Core	4	2006
Apple M1 (P-core)	8	2020
Apple M2 (P-core)	8	2022

**Issue Width** - how many instructions the CPU can dispatch per cycle. M1 has issue width = 8, but real-world IPC is ~3-4 due to dependencies.

What does superscalar mean?

Execution Units

To execute 4 instructions at once, 4 execution units are required:

Execution Unit	Operations	Count
Integer ALU	ADD, SUB, AND, OR, XOR	2-4
Load Unit	Memory reads	2
Store Unit	Memory writes	1-2
FPU	Floating-point operations	2
Branch Unit	Branches	1-2
SIMD/Vector	AVX/SSE operations	2

**Limitation:** Even with 8 ALUs, if all instructions depend on each other, IPC = 1. Parallelism must exist in the code itself!

What limits the real-world IPC of a superscalar processor?

Out-of-Order Execution

**Problem:** Instructions in a program are dependent on each other. But further in the queue there may be independent instructions!

**Out-of-Order (OoO):** Reorder and execute MUL while ADD is still computing!

**Reorder Buffer (ROB):** Stores instructions in program order. Results are committed to architectural registers in the correct order, even if execution was out of order.

What is the Reorder Buffer (ROB) for?

Register Renaming

**False dependencies:** Sometimes a dependency exists only in the register name, not in the actual data.

**Register Renaming:** Rename R1 to different physical registers!

Dependency	Type	Solution
RAW (Read After Write)	True	Forwarding, OoO
WAW (Write After Write)	False	Register Renaming
WAR (Write After Read)	False	Register Renaming

**Physical registers:** x86-64 has 16 architectural registers, but ~200 physical registers for renaming!

Speculative Execution

**Speculation:** Execute instructions ahead of time without knowing whether they will be needed.

**If the prediction is correct:** Results are committed, everything is fine.

**If the prediction is wrong:** Speculative results are flushed and rolled back.

**Spectre/Meltdown:** These vulnerabilities are based on the fact that speculative execution leaves traces in the cache even after rollback! This allows reading protected memory.

A superscalar CPU is always N times faster than a scalar one

The actual speedup is limited by dependencies in the code. Typical IPC is 2-4, not 8.

Even with 8 execution units, if the code is sequential, there is no parallelism to exploit.

What happens on a wrong speculation?

Key Ideas

Superscalar: multiple instructions per cycle (IPC > 1)
Execution Units: multiple ALUs, Load, Store, FPU
Out-of-Order: reordering for maximum parallelism
Register Renaming: eliminates WAW/WAR dependencies
Speculation: execution along the predicted branch path
Real IPC is limited by data dependencies in the code

Вопросы для размышления

Why does superscalar execution yield smaller gains on code with many data dependencies?
How does out-of-order execution help hide memory latency in superscalar processors?
What is the instruction-level parallelism wall, and why does it limit superscalar scaling?

Связанные уроки

os-01-intro

Computer Architecture

Superscalar: Multiple Instructions Per Cycle

Цели урока

Understand the principle of superscalar execution and IPC > 1
Know the role of multiple execution units
Understand Out-of-Order execution and the Reorder Buffer
Know Register Renaming for eliminating false dependencies
Understand Speculative Execution

Предварительные знания

Pipelining
Hazards
Branch Prediction

Pipelining

Modern processors execute instructions out of order compared to how they are written in the program. They reorder, speculate, and parallelize - all in pursuit of speed.

Understanding multithreaded code performance
Memory barriers in concurrent programming
Spectre/Meltdown vulnerabilities
Optimizing for specific microarchitectures

From IPC=1 to IPC>1

A **superscalar processor** executes multiple instructions per clock cycle using multiple execution units.

Processor	Issue Width	Year
Intel Pentium	2	1993
PowerPC 970	4	2002
Intel Core	4	2006
Apple M1 (P-core)	8	2020
Apple M2 (P-core)	8	2022

**Issue Width** - how many instructions the CPU can dispatch per cycle. M1 has issue width = 8, but real-world IPC is ~3-4 due to dependencies.

What does superscalar mean?

Execution Units

To execute 4 instructions at once, 4 execution units are required:

Execution Unit	Operations	Count
Integer ALU	ADD, SUB, AND, OR, XOR	2-4
Load Unit	Memory reads	2
Store Unit	Memory writes	1-2
FPU	Floating-point operations	2
Branch Unit	Branches	1-2
SIMD/Vector	AVX/SSE operations	2

**Limitation:** Even with 8 ALUs, if all instructions depend on each other, IPC = 1. Parallelism must exist in the code itself!

What limits the real-world IPC of a superscalar processor?

Out-of-Order Execution

**Problem:** Instructions in a program are dependent on each other. But further in the queue there may be independent instructions!

**Out-of-Order (OoO):** Reorder and execute MUL while ADD is still computing!

**Reorder Buffer (ROB):** Stores instructions in program order. Results are committed to architectural registers in the correct order, even if execution was out of order.

What is the Reorder Buffer (ROB) for?

Register Renaming

**False dependencies:** Sometimes a dependency exists only in the register name, not in the actual data.

**Register Renaming:** Rename R1 to different physical registers!

Dependency	Type	Solution
RAW (Read After Write)	True	Forwarding, OoO
WAW (Write After Write)	False	Register Renaming
WAR (Write After Read)	False	Register Renaming

**Physical registers:** x86-64 has 16 architectural registers, but ~200 physical registers for renaming!

Speculative Execution

**Speculation:** Execute instructions ahead of time without knowing whether they will be needed.

**If the prediction is correct:** Results are committed, everything is fine.

**If the prediction is wrong:** Speculative results are flushed and rolled back.

**Spectre/Meltdown:** These vulnerabilities are based on the fact that speculative execution leaves traces in the cache even after rollback! This allows reading protected memory.

A superscalar CPU is always N times faster than a scalar one

The actual speedup is limited by dependencies in the code. Typical IPC is 2-4, not 8.

Even with 8 execution units, if the code is sequential, there is no parallelism to exploit.

What happens on a wrong speculation?

Key Ideas

Superscalar: multiple instructions per cycle (IPC > 1)
Execution Units: multiple ALUs, Load, Store, FPU
Out-of-Order: reordering for maximum parallelism
Register Renaming: eliminates WAW/WAR dependencies
Speculation: execution along the predicted branch path
Real IPC is limited by data dependencies in the code

Вопросы для размышления

Why does superscalar execution yield smaller gains on code with many data dependencies?
How does out-of-order execution help hide memory latency in superscalar processors?
What is the instruction-level parallelism wall, and why does it limit superscalar scaling?

Связанные уроки

os-01-intro

Superscalar: Multiple Instructions Per Cycle

Цели урока

Предварительные знания

From IPC=1 to IPC>1

Execution Units

Out-of-Order Execution

Register Renaming

Speculative Execution

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Superscalar: Multiple Instructions Per Cycle

Цели урока

Предварительные знания

From IPC=1 to IPC>1

Execution Units

Out-of-Order Execution

Register Renaming

Speculative Execution

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки