## **CMSC 611: Advanced Computer Architecture**

**Pipelining** 

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / © 2003 Elsevier Science



- Washer takes 30 min, Dryer takes 40 min, folding takes 20 min
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would laundry take?



- Pipelining means start work as soon as possible
- Pipelined laundry takes 3.5 hours for 4 loads

## **Pipelining Lessons**



*a* 

*s* 

*r* 

*e* 

- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously using different resources
- Potential speedup = Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to "fill" pipeline and time to "drain" it reduce speedup
- **Stall for Dependencies**

## **MIPS Instruction Set**

- RISC characterized by the following features that simplify implementation:
	- All ALU operations apply only on registers
	- Memory is affected only by load and store
	- Instructions follow very few formats and typically are of the same size



## Single-cycle Execution



## **Multi-Cycle Implementation of MIPS**

**10 Instruction fetch cycle (IF)** 

 $IR \leftarrow$  Mem[PC]; NPC  $\leftarrow$  PC + 4

**2** Instruction decode/register fetch cycle (ID)

A  $\leftarrow$  Regs[IR<sub>6..10</sub>]; B  $\leftarrow$  Regs[IR<sub>11.15</sub>]; Imm  $\leftarrow$  ((IR<sub>16</sub>)<sup>16</sup> ##IR<sub>16..31</sub>)

**8** Execution/effective address cycle (EX)

Memory ref:  $ALUOutput \leftarrow A + 1mm;$ 

<u>Reg-Reg ALU:</u> ALUOutput ← A *func* B;

 $Reg-lmm ALU:$  ALUOutput  $\leftarrow$  A *op* Imm; Branch:  $ALUOutput \leftarrow NPC + Imm$ ;  $Cond \leftarrow (A op 0)$ 

**4** Memory access/branch completion cycle (MEM)

Memory ref: LMD  $\leftarrow$  Mem[ALUOutput] or Mem(ALUOutput]  $\leftarrow$  B;  $B$ ranch: if (cond) PC  $\leftarrow$  ALUOutput;

% **Write-back cycle (WB)** 

 $Reg-Reg ALU: RegI[R_{16..20}] \leftarrow ALUOutput;$  $Reg-lmm ALU:$  Regs[IR<sub>11.15</sub>]  $\leftarrow$  ALUOutput;  $\frac{Load}{11.15}$   $\leftarrow$  LMD;

## Multi-cycle Execution



## **Stages of Instruction Execution**



- The load instruction is the longest
- All instructions follows at most the following five steps:
	- Ifetch: Instruction Fetch
		- Fetch the instruction from the Instruction Memory and update PC
	- Reg/Dec: Registers Fetch and Instruction Decode
	- Exec: Calculate the memory address
	- Mem: Read the data from the Data Memory
	- WB: Write the data back to the register file

# **Instruction Pipelining**

- Start handling next instruction while the current instruction is in progress
- Feasible when different devices at different stages



## **Example of Instruction Pipelining**



*Ideal and upper bound for speedup is number of stages in the pipeline* 

#### **Single Cycle**



- Cycle time long enough for longest instruction
- Shorter instructions waste time
- No overlap

## **Multiple Cycle**



- Cycle time long enough for longest stage
- **Shorter stages waste time**
- Shorter instructions can take fewer cycles
- No overlap

## Pipeline



- Cycle time long enough for longest stage
- **Shorter stages waste time**
- No additional benefit from shorter instructions
- Overlap instruction execution

## **Pipeline Performance**

- Pipeline increases the instruction throughput
	- not execution time of an individual instruction
- An individual instruction can be **slower**:
	- Additional pipeline control
	- Imbalance among pipeline stages
- Suppose we execute 100 instructions:
	- Single Cycle Machine
		- $\cdot$  45 ns/cycle x 1 CPI x 100 inst = 4500 ns
	- Multi-cycle Machine
		- $\cdot$  10 ns/cycle x 4.2 CPI (due to inst mix) x 100 inst = 4200 ns
	- Ideal 5 stages pipelined machine
		- $\cdot$  10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
- Lose performance due to fill and drain

## **Pipeline Datapath**

- Every stage must be completed in one clock cycle to avoid stalls
- Values must be latched to ensure correct execution of **instructions**
- The PC multiplexer has moved to the IF stage to prevent two instructions from updating the PC simultaneously (in case of branch instruction)



## **Pipeline Stage Interface**



## **Pipeline Hazards**

- Cases that affect instruction execution semantics and thus need to be detected and corrected
- Hazards types
	- Structural hazard: attempt to use a resource two different ways at same time
		- Single memory for instruction and data
	- Data hazard: attempt to use item before it is ready
		- Instruction depends on result of prior instruction still in the pipeline
	- Control hazard: attempt to make a decision before condition is evaluated
		- branch instructions
- Hazards can always be resolved by waiting

#### **Visualizing Pipelining**

Time (clock cycles)



**I n s t r. O r d e r** 

## **Example: One Memory Port/ Structural Hazard**

**Time (clock cycles)** 



Slide: David Culler

## **Resolving Structural Hazards**

- 1. Wait
	- Must detect the hazard
		- Easier with uniform ISA
	- Must have mechanism to stall
		- Easier with uniform pipeline organization
- 2. Throw more hardware at the problem
	- Use instruction & data cache rather than direct access to memory

## **Detecting and Resolving Structural Hazard**





## **Stalls & Pipeline Performance**

Pipelining Speedup <sup>=</sup> Average instruction time unpipelined Average instruction time pipelined  $=\frac{CPI}{{2P1}}$  unpipelined CPI pipelined x Clock cycle unpipelined Clock cycle pipelined

```
Ideal CPI pipelined = 1
```
CPI pipelined = Ideal CPI+Pipeline stall cycles per instruction

 $=1+Pipeline$  stall cycles per instruction

 $Speedup = \frac{CPI \text{ unpipelined}}{1 - \sum_{i=1}^{N} m_i}$ 1 + Pipeline stall cycles per instruction x Clock cycle unpipelined Clock cycle pipelined

Assuming all pipeline stages are balanced

 $\overline{a}$  $Speedup = \frac{Pipeline depth}{Pipeline}$ 1 + Pipeline stall cycles per instruction

#### Data Hazards

**Time (clock cycles)** 



#### Three Generic Data Hazards

• Read After Write (RAW) Instr<sub>J</sub> tries to read operand before Instr<sub>I</sub> writes it

> **I: add r1,r2,r3 J: sub r4,r1,r3**

• Caused by a "Data Dependence" (in compiler nomenclature). This hazard results from an actual need for communication.

#### Three Generic Data Hazards

• Write After Read (WAR)  ${\sf Instr}_{\sf J}$  writes operand before  ${\sf Instr}_{\sf I}$  reads it

> **I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7**

- Called an "anti-dependence" in compilers.
	- This results from reuse of the name "r1".
- Can't happen in MIPS 5 stage pipeline because:
	- All instructions take 5 stages, and
	- Reads are always in stage 2, and
	- Writes are always in stage 5

#### Three Generic Data Hazards

• Write After Write (WAW)  $Instr_J$  writes operand before Instr<sub>i</sub> writes it.

> **I: mul r1,r4,r3 J: add r1,r2,r3 K: sub r6,r1,r7**

- Called an "output dependence" in compilers
	- This also results from the reuse of name "r1".
- Can't happen in MIPS 5 stage pipeline:
	- All instructions take 5 stages, and
	- Writes are always in stage 5
- Do see WAR and WAW in more complicated pipes

#### **Forwarding to Avoid Data** Hazard

**Time (clock cycles)** 

**I** 

**n s t** 

**r.** 

**O** 

**r d** 

**e** 

**r** 



#### HW Change for Forwarding



#### Data Hazard Eyen with Forwarding

**Time (clock cycles)** 



## **Resolving Load Hazards**

- Adding hardware? How? Where?
- Detection?
- Compilation techniques?

• What is the cost of load delays?

## Resolving the Load Data Hazard

Time (clock cycles)



How is this different from the instruction issue stall?

Slide: David Culler

## **Software Scheduling to Avoid Load Hazards**

**Try producing fast code for a = b + c;** 

 $d = e - f$ ;

#### **assuming a, b, c, d ,e, and f in memory.**



## **Instruction Set Connection**

- What is exposed about this organizational hazard in the instruction set?
- k cycle delay?
	- bad, CPI is not part of ISA
- k instruction slot delay
	- load should not be followed by use of the value in the next k instructions
- Nothing, but code can reduce run-time delays
- MIPS did the transformation in the assembler

## **Pipeline Hazards**

- Cases that affect instruction execution semantics and thus need to be detected and corrected
- Hazards types
	- Structural hazard: attempt to use a resource two different ways at same time
		- Single memory for instruction and data
	- Data hazard: attempt to use item before it is ready
		- Instruction depends on result of prior instruction still in the pipeline
	- Control hazard: attempt to make a decision before condition is evaluated
		- branch instructions
- Hazards can always be resolved by waiting

## **Control Hazard on Branches Three Stage Stall**



## **Example: Branch Stall Impact**

- If 30% branch, 3-cycle stall significant!
- Two part solution:
	- Determine branch taken or not sooner, AND
	- Compute taken branch address earlier
- MIPS branch tests if register  $= 0$  or  $\neq 0$
- MIPS Solution:
	- Move Zero test to ID/RF stage
	- Adder to calculate new PC in ID/RF stage
	- 1 clock cycle penalty for branch versus 3

#### **Pipelined MIPS Datapath**



# Four Branch Hazard **Alternatives**

- 1. Stall until branch direction is clear
- 2. Predict Branch Not Taken
	- Execute successor instructions in sequence
	- "Squash" instructions in pipeline if branch taken
	- Advantage of late pipeline state update
	- 47% MIPS branches not taken on average
	- PC+4 already calculated, so use it to get next instruction
- 3. Predict Branch Taken
	- 53% MIPS branches taken on average
	- But haven't calculated branch target address in MIPS
		- MIPS still incurs 1 cycle branch penalty
		- Other machines: branch target known before outcome

# **Four Branch Hazard Alternatives**

- 4. Delayed Branch
	- Define branch to take place AFTER a following instruction
		- branch instruction

........

sequential successor<sub>1</sub> sequential successor<sub>2</sub>

sequential successor<sub>n</sub>

**Branch delay of length** *n* 

branch target if taken

- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- MIPS uses this

........

## **Delayed Branch**

- Where to get branch delay slot instructions?
	- Before branch instruction
	- From the target address
		- only valuable when branch taken
	- From fall through
		- only valuable when branch not taken
	- Canceling branches allow more slots to be filled
- Compiler effectiveness for single delay slot:
	- Fills about 60% of branch delay slots
	- About 80% of instructions executed in branch delay slots useful in computation
	- 48% (60% x 80%) of slots usefully filled
- Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)



# **Branch-Delay Scheduling** Requirements



- Limitation on delayed-branch scheduling arise from:
	- Restrictions on instructions scheduled into the delay slots
	- Ability to predict at compile-time whether a branch is likely to be taken
- May have to fill with a no-op instruction
	- Average 30% wasted
- Additional PC is needed to allow safe operation in case of interrupts (more on this later)

# **Example: Evaluating Branch Alternatives**

Pipeline speedup =  $\frac{\text{Pipeline depth}}{\text{A} + \text{Diseling atell}}$ 

1 + Pipeline stall CPI

 $=$   $\frac{Pipeline depth}{4 + Rreneh frecurence (Rr)$ 

1 + Branch frequency x Branch penalty

Assume:

14% Conditional & Unconditional

65% Taken; 52% Delay slots not usefully filled



Slide: David Culler

#### **Static Branch Prediction**

- Examination of program behavior
	- Assume branch is usually taken based on statistics but misprediction rate still 9%-59%
- Predict on branch direction forward/backward based on statistics and code generation convention
	- Profile information from earlier program runs

