ISA Change
Assume load operations are 20% of the total instructions for your workload and stores 10%. You are considering eliminating the immediate from the load and store instructions from a MIPS-like instruction set. That would replace
LW R1, offset(R2)
with
ADDI R2, R2, offset
LW R1, (R2)
But 40% of the loads and stores use an offset of 0, which allows replacing
LW R1, 0(R2)
with just
LW R1, (R2)
Organization
This change will allow you to combine the EX and DM stages of the 5-stage MIPS pipeline. Draw the organization for the new 4-stage pipeline.
Forwarding
Complete a pipeline execution timeline for the original 5-stage MIPS pipeline with forwarding for the following sequence of instructions. If forwarding occurs, indicate the stages involved.
LD R1, 0(R2)
ADD R3, R1, R1
Complete a pipeline execution timeline for the new 4-stage pipeline with forwarding for the following sequence of instructions. If forwarding occurs, indicate the stages involved.
LD R1, (R2)
ADD R3, R1, R1
Speedup
According to data collected with your application using the original ISA, 10% of loads have a load-to-ALU data hazard that the compiler does not remove. Assuming no change in clock speed, use this data, together with the other data given above, to find the overall expected speedup of the 4-stage design over the original 5-stage MIPS architecture.
MIPS R4000 Pipeline
The MIPS R4000 processor has an eight-stage pipeline, with stages shown below
Speedup
What is the ideal pipeline speedup for this processor?
Forwarding
For data forwarding, how many additional inputs are needed for the multiplexers at the inputs to the ALU? Where does each come from? What kind of hazard(s) do these address?
Branching options
The branch delay is 3 cycles. If branches make 20% of the total instruction mix and 14% of the branches are taken, evaluate the effective CPI and adjusted pipelines speedup for each of these options:
- Stall for three cycles
- Expose one delay slot and stall for two when the delay slot can be filled 60% of the time and the resulting computation is useful 80% of the time
- Expose two delay slots and stall for one cycle where the first slot statistics are the same as (b), and the second slot is filled 10% of the time, with useful computation 40% of the time.
- Expose one delay slot and predict not taken for the other two cycles (what was really done).
Branches in practice
Given the strategy in (d), which of the following instructions are executed when the branch is taken? Which are executed when the branch is not taken?
BNEZ R5, Target
SUB R1, R2, R3
ADD R1, R1, R4
Target: SUB R3, R3, R1
ADD R2, R1, R4
ADD R2, R2, R3
Note that the first SUB instruction is in the branch delay slot, you do not need to try to fill the delay slot yourself.