<- previous index next ->
The pipeline for this course with branch and jump optimized:
Note the three input mux replacing two mux in previous lecture.
Note the distributed control using the equal6 entity:
eq6j: entity WORK.equal6 port map(ID_IR(31 downto 26), "000010", jump);
jumpaddr <= PCP(31 downto 28) & ID_IR(25 downto 0) & "00";
cs411_opcodes.txt look at jump
In a later lecture, we will cover data forwarding to avoid nop's in
arithmetic and automatic stall to avoid putting all nop's in source code.
For the basic machine above, we have the timing shown here.
The branch slot, programming to avoid delays (filling in nop's):
Note: beq and jump always execute the next physical instruction.
This is called the "delayed branch slot", important for HW7.
if(a==b) x=3; /* simple C code */
else x=4;
y=5;
lw $1,a # possible unoptimized assembly language
lw $2,b # no ($0) shown on memory access
nop # wait for b to get into register 2
nop # wait for b to get into register 2
beq $1,$2,lab1
nop # branch slot, always executed *********
addi $1,4 # else part
nop # wait for 4 to get into register 1
nop # wait for 4 to get into register 1
sw $1,x # x=4;
j lab2
nop # branch slot, always executed *********
lab1: addi $1,3 # true part
nop # wait for 3 to get into register 1
nop # wait for 3 to get into register 1
sw $1,x # x=3;
lab2: addi $1,5 # after if-else, always execute
nop # wait for 5 to get into register 1
nop # wait for 5 to get into register 1
sw $1,y # y=5;
Unoptimized, 20 instructions.
Now, a smart compiler would produce the optimized code:
lw $1,a # possible unoptimized assembly language
lw $2,b # no ($0) shown on memory access
addi $4,4 # for else part later
addi $3,3 # for true part later
beq $1,$2,lab1
addi $5,5 # branch slot, always executed, for after if-else
j lab2
sw $4,x # x=4; in branch slot, always executed !! after jump
lab1: sw $3,x # x=3;
lab2: sw $5,y # y=5;
Optimized, 10 instructions.
The pipeline stage diagram for a==b true is:
1 2 3 4 5 6 7 8 9 10 11 12 clock
lw $1,a IF ID EX MM WB
lw $2,b IF ID EX MM WB
addi $4,4 IF ID EX MM WB
addi $3,3 IF ID EX MM WB
beq $1,$2,L1 IF ID EX MM WB assume equal, branch to L1
addi $5,5 IF ID EX MM WB delayed branch slot
j L2
sw $4,x
L1:sw $3,x IF ID EX MM WB
L2:sw $5,y IF ID EX MM WB
1 2 3 4 5 6 7 8 9 10 11 12
The pipeline stage diagram for a==b false is:
1 2 3 4 5 6 7 8 9 10 11 12 13 clock
lw $1,a IF ID EX MM WB
lw $2,b IF ID EX MM WB
addi $4,4 IF ID EX MM WB
addi $3,3 IF ID EX MM WB
beq $1,$2,L1 IF ID EX MM WB assume not equal
addi $5,5 IF ID EX MM WB
j L2 IF ID EX MM WB jumps to L2
sw $4,x IF ID EX MM WB
L1:sw $3,x
L2:sw $5,y IF ID EX MM WB
1 2 3 4 5 6 7 8 9 10 11 12 13
if(a==b) x=3; /* simple C code */
else x=4;
y=5;
Renaming when there are extra registers that the programmer can
not assess (diagram in Alpha below) with multiple units there can be
multiple issue (parallel execution of instructions)
The architecture sees the binary instructions from the following:
lw $1,a
lw $2,b
nop
sll $3,$1,8
sll $6,$2,8
add $9,$1,$2
sw $3,c
sw $6,d
sw $9,e
lw $1,aa
lw $2,bb
nop
sll $3,$1,8
sll $6,$2,8
add $9,$1,$2
sw $3,cc
sw $6,dd
sw $9,ee
Two ALU's, each with their own pipelines, multiple issue, register renaming:
The architecture executes two instruction streams in parallel.
(Assume only 32 user programmable registers, 80 registers in hardware.)
lw $1,a lw $41,aa
lw $2,b lw $42,bb
nop nop
sll $3,$1,8 sll $43,$41,8
sll $6,$2,8 sll $46,$42,8
add $9,$1,$2 add $49,$41,$42
sw $3,c sw $43,cc
sw $6,d sw $46,dd
sw $9,e sw $49,ee
Out of order execution to avoid delays. As seen in the first example,
changing the order of execution without changing the semantics of the
program can achieve faster execution.
There can be multiple issue when there are multiple arithmetic and
other units. This will require significant hardware to detect the
amount of out of order instructions that can be issued each clock.
Now, hardware can also be pipelined, for example a parallel multiplier.
Suppose we need to have at most 8 gate delays between pipeline
registers.
Note that any and-or-not logic can be converted to use only nand gates
or only nor gates. Thus, two level logic can have two gate delays.
We can build each multiplier stage with two gate delays. Thus we can
have only four multiplier stages then a pipeline register. Using a
carry save parallel 32-bit by 32-bit multiplier we need 32 stages, and
thus eight pipeline stages plus one extra stage for the final adder.
Note that a multiply can be started every clock. Thus a multiply
can be finished every clock. The speedup including the last adder
stage is 9 as shown in:
pipemul_test.vhdl
pipemul_test.out
pipemul.vhdl
A 64-bit PG adder may be built with eight or less gate delays.
The signals a, b and sum are 64 bits. See add64.vhdl for details.
add64.vhdl
Any combinational logic can be performed in two levels with "and" gates
feeding "or" gates, assuming complementation time can be ignored.
Some designers may use diagrams but I wrote a Quine McClusky minimization
program that computes the two level and-or-not VHDL statement
for combinational logic.
quine_mcclusky.c logic minimization
eqn4.dat input data
eqn4.out both VHDL and Verilog output
there are 2^2^N possible functions of N bits
Not as practical, I wrote a Myhill minimization of a finite state machine,
a Deterministic Finite Automata, that inputs a state transition table
and outputs the minimum state equivalent machine. "Not as practical"
because the design of sequential logic should be understandable. The
minimized machine's function is typically unrecognizable.
myhill.cpp state minimization
initial.dfa input data
myhill.dfa minimized output
A reasonably complete architecture description for the Alpha
showing the pipeline is:
basic Alpha
more complete Alpha
The "Cell" chip has unique architecture:
Cell architecture
Some technical data on Intel Core Duo (With some advertising.)
Core Duo all on WEB
From Intel, with lots of advertising:
power is proportional to capacitance * voltage^2 * frequency, page 7.
tech overview
whitepaper
Intel quad core demonstrated
AMD quad core
By 2010 AMD had a 12-core available and Intel had a 8-core available.
and 24 core and 48 core AMD
IBM Power6 at 4.7GHz clock speed
Intel I7 920 Nehalem 2.66GHz not quad $279.99
Intel I7 940 Nehalem 2.93GHz quad core $569.99
Intel I7 965 Nehalem 3.20GHz quad core $999.99
Prices vary with time, NewEgg.com search Intel I7
Motherboard Asus products-motherboards-intel i7
Intel socket 1366
Supermicro.com motherboards, 12-core
local, bad formatting, in case web page goes away. Good history.
Core Duo 1
Core Duo 2
Core Duo 3
Core Duo 4
Core Duo 5
Core Duo 6
Core Duo 7
Core Duo 8
HW7 is assigned
<- previous index next ->