<- previous index next ->
Data forwarding example CMSC 411 architecture
Consider the five stage pipeline architecture:
IF instruction fetch, PC is address into memory fetching instruction
ID instruction decode and register read out of two values
EX execute instruction or compute data memory address
M data memory access to store or fetch a data word
WB write back value into general register
IF ID EX M WB
+--+ +--+ +--+ +--+ +--+
| | | | | A|-|\ | | | |
| | | | /---| | \ \_| | | |
|PC|-(I)-|IR|-(R) = | | / / | |-(D)-| |--+
| | | | ^ \---| B|-|/ | | | | |
+--+ +--+ | +--+ +--+ +--+ |
^ ^ | ^ ALU ^ ^ |
| | | | | | |
clk-+--------+-----------+--------+--------+ |
| |
+-----------------------------+
Now consider the instruction sequence:
400 lw $1,100($0) load general register 1 from memory location 100
404 lw $2,104($0) load general register 2 from memory location 104
408 nop
40C nop wait for register $2 to get data
410 add $3,$1,$2 add contents of registers 1 and 2, sum into register 3
414 nop
418 nop wait for register $3 to get data
41C add $4,$3,$1 add contents of registers 3 and 1, sum into register 4
420 nop
424 nop wait for register $4 to get data
428 beq $3,$4,-100 branch if contents of register 3 and 4 are equal to 314
42C add $4,$4,$4 add ..., this is the "delayed branch slot" always exec.
The pipeline stage table with NO data forwarding is:
lw IF ID EX M WB
lw IF ID EX M WB
nop IF ID EX M WB
nop IF ID EX M WB
add IF ID EX M WB
nop IF ID EX M WB
nop IF ID EX M WB
add IF ID EX M WB
nop IF ID EX M WB
nop IF ID EX M WB
beq IF ID EX M WB
add IF ID EX M WB
time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
This can be significantly improved with the addition of four
multiplexors and wiring.
IF ID EX M WB
+--+ +--+ +--+ +--+ +--+
| | | | | A|-(X)--|\ | | | |
| | | | /-(X)--| | | | \ \_| | | |
|PC|-(I)-|IR|-(R) | = | | | | / / | |-+-(D)-| |--+
| | | | ^ \-(X)--| B|-(X)--|/ | | | | | |
+--+ +--+ | | +--+ | | +--+ | +--+ |
^ ^ | | ^ | | ALU ^ | ^ |
| | | | | | | | | | |
clk-+--------+--------------+-------------+----------+ |
| | | | | |
| +----------+-----------+ |
| | |
+-------------+-------------------------+
The pipeline stage table with data forwarding is:
lw IF ID EX M WB
lw IF ID EX M WB
nop IF ID EX M WB saved one nop
add IF ID EX M WB $2 in WB and used in EX
add IF ID EX M WB saved two nop's $3 used
nop IF ID EX M WB saved one nop
beq IF ID EX M WB $4 in MEM and used in ID
add IF ID EX M WB
time 1 2 3 4 5 6 7 8 9 10 11 12
Note the required nop from using data immediately after a load.
Note the required nop for the beq in the ID stage using an ALU result.
The data forwarding paths are shown in green with the additional
multiplexors. The control is explained below.
Green must be added to part2a.vhdl.
Blue already exists, used for discussion, do not change.
To understand the logic better, note that MEM_RD contains the register
destination of the output of the ALU and MEM_addr contains the value
of the output of the ALU for the instruction now in the MEM stage.
If the instruction in the EX stage has the MEM_RD destination in
bits 25 downto 21, then MEM_addr must be routed to the A side of the ALU.
(This is the A forward MEM_addr control signal.)
EX stage MEM stage
add $4,$3,$1 add $3,$1,$2
| |
+---------------+
If the instruction in the EX stage has the MEM_RD destination in
bits 20 downto 16, then MEM_addr must be routed to the B side of the ALU.
(This is the B forward MEM_addr control signal.)
EX stage MEM stage
add $4,$1,$3 add $3,$1,$2
| |
+------------+
To understand the logic better, note that WB_RD contains the register
destination of the output of the ALU or Memory and WB_result contains
the value of the output of the ALU or Memory for the instruction now
in the WB stage.
If the instruction in the EX stage has the WB_RD destination in
bits 25 downto 21, then WB_result must be routed to the A side of the ALU.
(This is the A forward WB_result control signal.)
If the instruction in the EX stage has the WB_RD destination in
bits 20 downto 16, then WB_result must be routed to the B side of the ALU.
(This is the B forward WB_result control signal.)
Note that a beq instruction in the ID stage that needs a value from
the instruction in the WB stage does not need data forwarding.
A beq instruction in the ID stage has the MEM_RD destination in
bits 25 downto 21, then MEM_addr must be routed to the top side of
the equal comparator.
(This is the 1 forward control signal.)
A beq instruction in the ID stage has the MEM_RD destination in
bits 20 downto 16, then MEM_addr must be routed to the bottom side of
the equal comparator.
(This is the 2 forward control signal.)
ID stage EX stage MEM stage
beq $3,$4,-100 nop add $4,$3,$1
| |
+----------------------------+
A beq instruction in the ID stage has the WB_RD destination in
bits 20 downto 16, then WB_result must be used by the bottom side of
the equal comparator.
(This happens by magic. Not really, two rules above apply.)
ID stage EX stage MEM stage WB stage
beq $3,$4,-100 nop nop lw $4,8($3)
| |
+-------------------------------------+
The data forwarding rules can be summarized based on the
cs411 schematic, shown above.
ID stage beq data forwarding:
default with no data forwarding is ID_read_data_1
1 forward MEM_addr is ID_reg1=MEM_RD and MEM_rd/=0 and MEM_OP/=lw
default with no data forwarding is ID_read_data_2
2 forward MEM_addr is ID_reg2=MEM_RD and MEM_rd/=0 and MEM_OP/=lw
EX stage data forwarding:
default with no data forwarding is EX_A
A forward MEM_addr is EX_reg1=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
A forward WB_result is EX_reg1=WB_RD and WB_RD/=0
default with no data forwarding is EX_B
B forward MEM_addr is EX_reg2=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
B forward WB_result is EX_reg2=WB_RD and WB_RD/=0
Note: the entity mux32_3 is designed to handle the above.
ID_RD is 0 for ID_OP= beq, j, sw (nop, all zeros, automatic zero in RD)
thus EX_RD, MEM_RD, WB_RD = 0 for these instructions
Because register zero is always zero, we can use 0 for
a destination for every instruction that does not
produce a result in a register. Thus no data forwarding
will occur for instructions that do not produce a value
in a register.
note: ID_reg1 is ID_IR(25 downto 21)
ID_reg2 is ID_IR(20 downto 16)
EX_reg1 is EX_IR(25 downto 21)
EX_reg2 is EX_IR(20 downto 16)
MEM_OP is MEM_IR(31 downto 26)
EX_OP is EX_IR(31 downto 26)
ID_OP is ID_IR(31 downto 26)
These shorter names can be used with VHDL alias statements
alias ID_reg1 : word_5 is ID_IR(25 downto 21);
alias ID_reg2 : word_5 is ID_IR(20 downto 16);
alias EX_reg1 : word_5 is EX_IR(25 downto 21);
alias EX_reg2 : word_5 is EX_IR(20 downto 16);
alias MEM_OP : word_6 is MEM_IR(31 downto 26);
alias EX_OP : word_6 is EX_IR(31 downto 26);
alias ID_OP : word_6 is ID_IR(31 downto 26);
Why is the priority mux, mux32_3 needed?
mux32_3.vhdl gives priority to ct1 over ct2
Answer: Consider MEM_RD with a destination value 3 and
WB_RD with a destination value 3.
What should add $4,$3,$3 use? MEM_addr or WB_result ?
For this to happen, some program or some person would have
written code such as:
sub $3,$12,$11
add $3,$1,$2
add $4,$3,$3 double the value of $3
Well, rather obviously, the result of the sub is never used and
thus the answer to our question is that MEM_addr must be used. This
is the closest prior instruction with the required result. The
correct design is implemented using the priority mux32_3 with the
MEM_addr in the in1 priority input.
The control signal A forward MEM_addr may be implemented in VHDL as:
btw: 100011 in any_IR(31 downto 26) is the lw opcode in this example,
be sure to check this semesters cs411_opcodes.txt
Here is where you may want to add a debug process. Replace AFMA
with any signal name of interest:
prtAFMA: process (AFMA)
variable my_line : LINE; -- my_line needs to be defined
begin
write(my_line, string'("AFMA="));
write(my_line, AFMA); -- or hwrite for long signals
write(my_line, string'(" at="));
write(my_line, now); -- "now" is simulation time
writeline(output, my_line); -- outputs line
end process prtAFMA;
part2a.chk has the _RD signals and values
cs411_opcodes.txt for op code values
Now, to finish part2a.vhdl, the jump and branch instructions must be
implemented. This is shown in green on the upper part of the schematic.
The signal out of the jump address box would be coded in VHDL as:
jump_addr <= PCP(31 downto 28) & ID_IR(25 downto 0) & "00";
The adder symbol is just another instance of your Homework 4, add32.
The "shift left 2" is a simple VHDL statement:
shifted2 <= ID_sign_ext(29 downto 0) & "00";
The project writeup: part2a
<- previous index next ->