Pipelining
The MIPS R4000 processor has an 8-stage pipeline with stages shown below
Draw a timing diagram for the following sequence of instructions, with cycles on the horizontal axis and instructions issued on the vertical axis. Show any necessary stalls and draw arrows between stages for any necessary forwarding.
LW $s1, 0($s1)
LW $s2, 0($s2)
ADD $s3, $s1, $s2
ADDI $s3, #1
LW $s4, 0($s3)
ADDI $s4, #5
SW 0($s3), $s4
Branching
The R4000 branch delay is 3 cycles. Assuming branch prediction with a BTB in the IF stage, consider the following code
BNEZ $s5, Target
SUB $s1, $s2, $s3
Target: ADD $s1, $s1, $s4
SUB $s3, $s3, $s1
ADD $s2, $s1, $s4
ADD $s2, $s2, $s3
For each of the following, draw a timing diagram with cycles on the horizontal axis and instructions issued on the vertical axis. Clearly indicate instructions that are issued speculatively, then abandoned
Correct Predition of Not Taken
Draw the diagram if the branch is predicted to not be taken, and is, in fact, not taken.
Correct Prediction of Taken
Draw the diagram if the branch is predicted to be taken, and is, in fact, taken.
Incorrect Predition of Not Taken
Draw the diagram if the branch is predicted to not be taken, but is actually taken.
Incorrect Prediction of Taken
Draw the diagram if the branch is predicted to be taken, but is actually not taken.
Instruction Set Architecture
The Intel 4004 was a 4-bit microprocessor, and the first processor created by Intel. For this question, refer to the Intel MCS-4 Assembly Language Programming Manual.
The SUB
instruction performs A = A + ~R + ~C
for accumulator A, Register R, and carry bit C; and the SBM
instruction performs A = A + ~M + ~C
for M in memory.
A 2's complement negation is -R = ~R + 1
, so these instructions performs a 2's complement subtract if C is initialized to 0. As seen in section 4.6 of the 4004 manual, you can chain these instructions for subtracts of larger than four bits, but need to complement the carry between each 4 bits of a multi-digit subtract.
Non-inverted Carry
If the subtract instructions instead performed A = A + ~R + C
(without inverting the carry), you would need to set the carry to 1 before a single four-bit subtract or the first of a sequence, but would not need to do anything to the carry between digits of a multi-digit subtract. Rewrite the code on page 4-17 for this new version of the subtract
Local speedup
What is the expected speedup of just this subtract code, as a function of the number of loop iterations?
Overall speedup
Assuming 32-bit subtract operations are 5% of the total run time of a program, what is the overall speedup?