# **CMSC 611: Advanced Computer Architecture**

Pipelining

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / © 2003 Elsevier Science



- Washer takes 30 min, Dryer takes 40 min, folding takes 20 min
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would laundry take?



- Pipelining means start work as soon as possible
- Pipelined laundry takes 3.5 hours for 4 loads

# **Pipelining Lessons**



- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously using different resources
- Potential speedup = Number pipe stages
- Unbalanced lengths of pipe stages
  reduces speedup
- Time to "fill" pipeline and time to "drain" it reduce speedup
- Stall for Dependencies

## **MIPS Instruction Set**

- RISC characterized by the following features that simplify implementation:
  - All ALU operations apply only on registers
  - Memory is affected only by load and store
  - Instructions follow very few formats and typically are of the same size

| 31 | 26     | 21               | 16     | 11     | 6         | 0      |
|----|--------|------------------|--------|--------|-----------|--------|
|    | ор     | rs               | rt     | rd     | shamt     | funct  |
|    | 6 bits | 5 bits           | 5 bits | 5 bits | 5 bits    | 6 bits |
| 31 | 26     | 21               | 16     |        |           | 0      |
|    | ор     | rs               | rt     |        | immediate |        |
|    | 6 bits | 5 bits           | 5 bits |        | 16 bits   |        |
| 31 | 26     |                  |        |        |           | 0      |
|    | ор     | p target address |        |        |           |        |
|    | 6 bits | 26 bits          |        |        |           |        |

- R-type (register)
  - Most operations
    - add \$t1, \$s3, \$s4 # \$t1 = \$s3 + \$s4
  - rd, rs, rt all registers

- op always 0, funct gives actual function



#### I-type (immediate)

- ALU with one immediate operand
  - addi \$t1, \$s2, 32 # \$t1 = \$s2 + 32
- Load, store within ±2<sup>15</sup> of register
  - lw \$t0, 32(\$s2) # \$s1 = \$s2[32] or \*(32+s2)
- Load immediate values
  - lui \$t0, 255 # \$t0 = (255<<16)
  - li \$t0, 255

| 31 | 26     | 21     | 16     | 0         |
|----|--------|--------|--------|-----------|
|    | ор     | rs     | rt     | immediate |
|    | 6 bits | 5 bits | 5 bits | 16 bits   |

- I-type (immediate)
  - PC-relative conditional branch
  - ±2<sup>15</sup> from PC after instruction
    - beq \$s1, \$s2, L1 # goto L1 if (\$s1 = \$s2)
    - bne \$s1, \$s2, L1 # goto L1 if (\$s1 ≠ \$s2)



- J-type (jump)
  - unconditional jump
    - j L1 # goto L1
  - Address is concatenated to top bits of PC
    - Fixed addressing within 2<sup>26</sup>



## **Single-cycle Execution**



# MIPS

- **1** Instruction fetch cycle (IF)
  - IR  $\leftarrow$  Mem[PC]; NPC  $\leftarrow$  PC + 4
- Instruction decode/register fetch cycle (ID)

 $A \leftarrow \text{Regs}[\text{IR}_{6..10}]; \qquad B \leftarrow \text{Regs}[\text{IR}_{11..15}]; \qquad \text{Imm} \leftarrow ((\text{IR}_{16})^{16} \# \# \text{IR}_{16..31})$ 

Execution/effective address cycle (EX)

| Memory ref:          | ALUOutput ← A + Imm;         |                            |
|----------------------|------------------------------|----------------------------|
| Reg-Reg ALU:         | ALUOutput ← A <i>func</i> B; |                            |
| <u>Reg-Imm ALU</u> : | ALUOutput ← A op Imm;        |                            |
| Branch:              | ALUOutput                    | Cond $\leftarrow$ (A op 0) |

**O** Memory access/branch completion cycle (MEM)

| Memory ref: | LMD                                  | or | Mem(ALUOutput] $\leftarrow$ B; |
|-------------|--------------------------------------|----|--------------------------------|
| Branch:     | if (cond) PC $\leftarrow$ ALUOutput; |    |                                |

**O** Write-back cycle (WB)

Reg-Reg ALU:Regs[ $IR_{16..20}$ ]  $\leftarrow$  ALUOutput;Reg-Imm ALU:Regs[ $IR_{11..15}$ ]  $\leftarrow$  ALUOutput;Load:Regs[ $IR_{11..15}$ ]  $\leftarrow$  LMD;

## **Multi-cycle Execution**





- The load instruction is the longest
- All instructions follows at most the following five steps:
  - Ifetch: Instruction Fetch
    - Fetch the instruction from the Instruction Memory and update PC
  - Reg/Dec: Registers Fetch and Instruction Decode
  - Exec: Calculate the memory address
  - Mem: Read the data from the Data Memory
  - WB: Write the data back to the register file

# **Instruction Pipelining**

- Start handling next instruction while the current instruction is in progress
- Feasible when different devices at different stages



# **Example of Instruction Pipelining**



Ideal and upper bound for speedup is number of stages in the pipeline



- Cycle time long enough for longest instruction
- Shorter instructions waste time
- No overlap

## **Multiple Cycle**



- Cycle time long enough for longest stage
- Shorter stages waste time
- Shorter instructions can take fewer cycles
- No overlap

## **Pipeline**



- Cycle time long enough for longest stage
- Shorter stages waste time
- No additional benefit from shorter instructions
- Overlap instruction execution

## **Pipeline Performance**

- Pipeline increases the instruction throughput
  - not execution time of an individual instruction
- An individual instruction can be **slower**:
  - Additional pipeline control
  - Imbalance among pipeline stages
- Suppose we execute 100 instructions:
  - Single Cycle Machine
    - 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
  - Multi-cycle Machine
    - 10 ns/cycle x 4.2 CPI (due to inst mix) x 100 inst = 4200 ns
  - Ideal 5 stages pipelined machine
    - 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
- Lose performance due to fill and drain

## **Pipeline Datapath**

- Every stage must be completed in one clock cycle to avoid stalls
- Values must be latched to ensure correct execution of instructions
- The PC multiplexer has moved to the IF stage to prevent two instructions from updating the PC simultaneously (in case of branch instruction)



## **Pipeline Stage Interface**

| Stage | Any Instruction                                                                                                                                                                                                                                                         |                                                                                                                                        |                                                               |  |  |
|-------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|--|--|
| IF    | IF/ID.IR ←MEM[PC] ;<br>IF/ID.NPC,PC ← ( if ( (EX/MEM.opcode == branch) & EX/MEM.cond)<br>{EX/MEM.ALUOutput } else { PC + 4 } ) ;                                                                                                                                        |                                                                                                                                        |                                                               |  |  |
| ID    | $\begin{split} &\text{ID/EX.A = Regs[IF/ID. IR_{610}]; ID/EX.B \leftarrow Regs[IF/ID. IR_{1115}];} \\ &\text{ID/EX.NPC \leftarrow IF/ID.NPC ; ID/EX.IR \leftarrow IF/ID.IR;} \\ &\text{ID/EX.Imm} \leftarrow (IF/ID. IR_{16})^{16} \# \# IF/ID. IR_{1631}; \end{split}$ |                                                                                                                                        |                                                               |  |  |
|       | ALU                                                                                                                                                                                                                                                                     | Load or Store                                                                                                                          | Branch                                                        |  |  |
| EX    | EX/MEM.IR = ID/EX.IR;<br>EX/MEM. ALUOutput ←<br>ID/EX.A func ID/EX.B;<br>Or<br>EX/MEM.ALUOutput ←<br>ID/EX.A op ID/EX.Imm;<br>EX/MEM.cond ← 0;                                                                                                                          | EX/MEM.IR ← ID/EX.IR;<br>EX/MEM.ALUOutput ←<br>ID/EX.A + ID/EX.Imm;<br>EX/MEM.cond ← 0;                                                | EX/MEM.ALUOutput ←<br>ID/EX.NPC + ID/EX.Imm;<br>EX/MEM.cond ← |  |  |
| МЕМ   | MEM/WB.IR ←EX/MEM.IR;<br>MEM/WB.ALUOutput ←<br>EX/MEM.ALUOutput;                                                                                                                                                                                                        | EX/MEM.B ←ID/EX.B;<br>MEM/WB.IR ← EX/MEM.IR;<br>MEM/WB.LMD ←<br>Mem[EX/MEM.ALUOutput] ;<br>Or<br>Mem[EX/MEM.ALUOutput] ←<br>EX/MEM.B ; | (ID/EX.A op 0);                                               |  |  |
| WB    | Regs[MEM/WB. IR <sub>1620</sub> ] ←<br>EM/WB.ALUOutput;<br>Or<br>Regs[MEM/WB. IR <sub>1115</sub> ] ←<br>MEM/WB.ALUOutput ;                                                                                                                                              | For load only:<br>Regs[MEM/WB. IR 1115] ←<br>MEM/WB.LMD;                                                                               |                                                               |  |  |

## **Pipeline Hazards**

- Cases that affect instruction execution semantics and thus need to be detected and corrected
- Hazards types
  - Structural hazard: attempt to use a resource two different ways at same time
    - Single memory for instruction and data
  - Data hazard: attempt to use item before it is ready
    - Instruction depends on result of prior instruction still in the pipeline
  - Control hazard: attempt to make a decision before condition is evaluated
    - branch instructions
- Hazards can always be resolved by waiting



Slide: David Culler



## **Resolving Structural Hazards**

- 1. Wait
  - Must detect the hazard
    - Easier with uniform ISA
  - Must have mechanism to stall
    - Easier with uniform pipeline organization
- 2. Throw more hardware at the problem
  - Use instruction & data cache rather than direct access to memory

# Detecting and Resolving Structural Hazard

Time (clock cycles)



Slide: David Culler

## **Stalls & Pipeline Performance**

 $\begin{aligned} \text{Pipelining Speedup} &= \frac{\text{Average instruction time unpipelined}}{\text{Average instruction time pipelined}} \\ &= \frac{\text{CPI unpipelined}}{\text{CPI pipelined}} \times \frac{\text{Clock cycle unpipelined}}{\text{Clock cycle pipelined}} \end{aligned}$ 

```
Ideal CPI pipelined = 1
```

CPI pipelined = Ideal CPI+ Pipeline stall cycles per instruction

= 1+ Pipeline stall cycles per instruction

Speedup =  $\frac{\text{CPI unpipelined}}{1 + \text{Pipeline stall cycles per instruction}} \times \frac{\text{Clock cycle unpipelined}}{\text{Clock cycle pipelined}}$ 

Assuming all pipeline stages are balanced

Speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall cycles per instruction}}$