[CMSC 411 Home] | [Syllabus] | [Project] | [VHDL resource] | [Homework 1-6] | [Homework 7-12] [Files] |
The most important item on all homework is YOUR NAME!
Print. No readable name, no credit.
Staple or clip pages together.
Homework must be submitted when due. You loose 10%, one grade,
the first day homework is late. Then 10% each week there after.
Max 50% off. A zero really hurts your average. Paper or EMail
to squire@umbc.edu is acceptable. If I can not read or understand
your homework, you do not get credit. Type or print if your handwriting
is bad. Homework is always due on a scheduled class day within 15 minutes
after the start of the class. If class is canceled then homework
is due the next time the class meets.
EMail only plain text! No word processor formats.
You may use a word processor or other software tools and
print the results and turn in paper.
Try using ^R in Pine, ~r in BSD Mail.
All parties involved in copying get a zero or worse on that assignment.
1) From the diagram passed out in class (labeled FIGURE 6.3)
a) What is the speedup of the pipelined execution (bottom)
over the single cycle execution (top) for the three instructions ?
b) For a large number of instructions we consider how often
an instruction can be completed, or started. From the
figure you can see the pipelined execution starts an
instruction every 2ns while the single cycle execution
starts an instruction every 8ns. What is the speedup
when a large number of instruction are executed?
c) Make a change to both executions. Make the ALU take 4ns
rather that the 2ns as shown on the figure. Neatly
redraw the figure with the new ALU time. Remember every
time block on the pipelined execution must be the same.
d) What is the speedup for c) when a large number of
instructions are executed.
FIGURE 6.3 crudely draw here:
Single cycle execution
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17ns
| | | | | | | | | | | | | | | | | |
+-------+---+-------+-------+---+
|IF |reg| ALU | data |reg|
+-------+---+-------+-------+---+
+-------+---+-------+-------+---+
|IF |reg| ALU | data |reg|
+-------+---+-------+-------+---+
+---
|IF
+---
Pipelined Execution
+-------+-------+-------+-------+-------+
|IF | reg| ALU | data |reg |
+-------+-------+-------+-------+-------+
+-------+-------+-------+-------+-------+
|IF | reg| ALU | data |reg |
+-------+-------+-------+-------+-------+
+-------+-------+-------+-------+-------+
|IF | reg| ALU | data |reg |
+-------+-------+-------+-------+-------+
| | | | | | | | | | | | | | | | | |
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17ns
2) Show the register dependency by circling the register where a value
is computed and circling the register where that value is used,
connecting the circles by a line. (Spread the code out for clarity.)
add $6, $3, $4
add $4, $5, $6
add $3, $6, $4
add $3, $3, $4
3) The following four lines of code can be reduced to exactly three lines
of code that produce the same output for all possible initial register
values. Every line must be correct to get credit for this part.
Loop: lw $2, 72($3)
addi $3, $3, 4
beq $3, $4, Loop
nop
Be sure to walk through your code for initial conditions:
$3 has 4, $4 has 8, memory location 76 has 16,
memory location 80 has 32,
memory location 84 has 64.
This exercise is demonstrating that the "delayed branch slot"
does not have to contain a nop. Your code will not have a nop
but will reorder and change the instructions to have some other
instruction after the beq.
1) Note: If a data hazard is NOT prevented by data forwarding, then
the pipeline stalls. Assume data forwarding is working, does the
following code stall? "stall" means it needs an extra nop instruction
inserted, If the code stalls then list the instruction(s)
labels of the instructions that get stalled (delayed due to extra nop's).
L1: lw $3, 50($2)
L2: add $2, $5, $4
L3: sw $2, 20($2)
L4: lw $5, 30($3)
L5: nop
L6: lw $6, 40($5)
L7: add $5, $5, $5
2) How many total clock cycles will the following code use from
start to completion in our MIPS architecture pipelined machine ?
(We have full data forwarding and hazard prevention that
automatically inserts nop as needed into the pipeline.)
add $5, $6, $7
lw $6, 80($7)
sub $7, $6, $5
and $8, $7, $6
3) List all the inputs and outputs of the "forwarding unit" with
their respective the number of bits. You may use the books signal
name or any signal name that indicates what the input or output does.
You may have to read the text as well as reading the diagram on
Page 416 in the textbook.
The eight signals are required.
Given the following sequences of word addresses, in decimal,
1, 4, 2, 8, 23, 6, 17, 3, 5, 24, 28, 19, 57, 0, 18, 3, 11
(no modification needed, just convert to binary
1 is 000001, 56 is 111000, etc, these are memory addresses)
1) Simulate an 8 word cache with one word per block, direct mapped.
a) For each address, list the six bit binary and indicate
H for hit, M for miss.
b) Draw the cache showing cache binary address and cache contents
after all addresses have been processed. Use (1) for the
contents of memory address 1, (4) for the contents of
memory address 4, etc.
2) Simulate a 16 word cache with four words per block, direct mapped.
a) For each address, list the six bit binary and indicate
H for hit, M for miss.
b) Draw the cache showing cache binary address and cache contents
after all addresses have been processed. Use (1) for the
contents of memory address 1, (4) for the contents of
memory address 4, etc.
eight word cache sixteen word cache
one word per block four words per block
00 01 10 11
+----+ +----+----+----+----+
000 | | 00 | | | | |
+----+ +----+----+----+----+
001 | | 01 | | | | |
+----+ +----+----+----+----+
010 | | 10 |( 8)|( 9)|(10)|(11)|
+----+ +----+----+----+----+
011 |(11)| 11 | | | | |
+----+ +----+----+----+----+
100 | |
+----+
101 | |
+----+
110 | |
+----+
111 | |
+----+
The last address, decimal 11, has its contents shown in the cache.
(You do not get points for this entry.)
Given three memory organizations below, for each, compute the time
in cycles to load the cache and compute the average memory latency.
a) b) c)
+-----+ +-----+ +-----+
| CPU | | CPU | | CPU |
+-----+ +-----+ +-----+
| | |
+-----+ +---+---+---+---+ +-----+
|cache| | cache | |cache|
+-----+ +---+---+---+---+ +-----+
/ \ / \ / \
| bus | | bus | | bus |
\ / \ / \ /
+-----+ +---+---+---+---+ +---+ +---+ +---+ +---+
| | | | | | | | | | | | | | | registers
+-----+ +---+---+---+---+ +---+ +---+ +---+ +---+
| | | | | | | | | | | |
| mem | | mem | |m0 | |m1 | |m2 | |m3 |
| | | | | | | | | | | |
+-----+ +---+---+---+---+ +---+ +---+ +---+ +---+
cache 16 words cache 16 words cache 16 words
per block per block per block
bus one word wide bus four words wide bus one word wide
memory one word wide memory four words wide memory, four independent
one word wide memories
The bus requires one clock to pass data.
Only one thing at a time can be on the bus.
The bus four words wide passes four words in one cycle.
The bus one word wide can pass only one word per cycle.
One cycle, the bus time, is required to send the address from the CPU
to the memory for all three memory organizations.
The address will always be on a 16 word boundary and the memories
know they are to send 16 words (the cache block size is 16 words)
Every memory takes 7 cycles from the time an address is applied
until the data is fetched from the memory. During this time the
address can not be changed.
The data fetched from memory then takes one cycle on the bus
to get into the cache. The next address can be applied to the
memory at the start of this bus transfer cycle. This is known
as fetching overlapped with bus transfer. It requires a register
to hold the bits the memory has fetched.
Not shown, is the address incrementer inside the memory unit.
For a) each address is one greater and 16 fetches occur.
b) each address is four greater and 4 fetches occur.
c) m0 gets addresses 0, 4, 8, 12, four fetches occur
m1 gets addresses 1, 5, 9, 13, four fetches occur
m2 gets addresses 2, 6, 10, 14, four fetches occur
m3 gets addresses 3, 7, 11, 15, four fetches occur
fetching is overlapped, concurrent, in m0, m1, m2, m3.
Show your work to get partial credit,
e.g. list each clock cycle (or range of clock cycles) and show
what is happening or a formula for this specific case that you
derive from looking at the clock cycles.
"W0" stands for the word at the base address.
The quote marks have the meaning 'ditto' which means same as above.
a) b) c)
1 address on bus address on bus address on bus
2 fetching W0 fetching W0-W3 m0 fetching W0
3 fetching W0 fetching W0-W3 " m1 fetching W1
4 fetching W0 fetching W0-W3 " " m2 fetching W2
5 fetching W0 fetching W0-W3 " " " m3 fetching W3
6 fetching W0 fetching W0-W3 " " " "
7 fetching W0 fetching W0-W3 " " " "
8 fetching W0 fetching W0-W3 " " " "
9 word 0 on bus words 0-3 on bus word 0 on bus
fetching W1 fetching W4-W7 m0 fetching W4 plus - " " "
10 fetching W1 fetching W4-W7 word 1 on bus
m1 fetching W5 plus " - " "
11 fetching W1 fetching W4-W7 word 2 on bus
m2 fetching W6 plus " " - "
etc. (please ignore the ditto marks if you don't understand them)
You may write out the full sequence or figure out a formula that
works for this case. Include the formula if you use one.
For each memory organization give the total clock cycles to load
the cache. The last word of the cache must be loaded thus count
the last bus cycle. This number is called the "miss penalty".
The miss penalty would be divided by 16, the cache block size,
to get the average increase in CPI for a cache miss, assuming
instructions are executed sequentially.
The miss penalty divided by 16 is called the average memory latency.
Note that this is less than the 7 cycles for a single memory fetch
for cases b) and c).
Why can't we let the CPU execute an instruction when the first word
is in the cache? Well, it might be the third word in the block that
the CPU needs.
Why can't we let the CPU execute an instruction when the word it
needs is in the cache? Well, what if the CPU instruction used that
word from the cache and computed a result that went into the last
word in the cache block! The CPU would take 5 clocks to compute the
value and put it into the cache but the cache may take 10 to 20 clocks
before the last word is fetched from memory and put into the last
word of the cache block, over-writing the computed value.
So, the CPU pipeline is stalled while the cache is being loaded.
This applies to part 3 of the project. Note that a "cycle" means
a clock cycle.
Given a virtual memory system with: virtual address 38 bits physical address 34 bits 32KB pages (15 bit page offset) Each page table entry has bits for valid, execute, read and dirty (4 bits total) and bits for a physical page number. a) How many bits in the page table? (do not answer in bytes!) Three digit accuracy is good enough. The exponent may be either a power of 2 or a power of 10. The virtual address is extended to 40 bits, all else stays the same. b) How many bits in the page table? (do not answer in bytes!) Three digit accuracy is good enough. The exponent may be either a power of 2 or a power of 10. Note: There will be a page table for every process that is running, yet the page tables are typically not completely allocated. Only the sections of the page table being used are typically populated. A fully associative TLB that has 64 blocks, 1 entry per block, is needed for the page table a). The TLB must hold a page table entry and a tag in each block. c) How many bits in the TLB? (do not answer in bytes!) d) Draw a two way associative TLB that has 4 blocks, 16 entries, for the page table a) Virtual address 38 bits, physical address 34 bits, offset 15 bits. The top will be the virtual address with the virtual page number and virtual page offset. The bottom will be the physical address with the physical page number and physical page offset. Show the detail of all connections. Label the width of all fields and signals. Refer to the textbook or class handout for sample TLB's.
Given a memory and an I/O device on a bus as shown below:
64 bit wide bus, synchronous, 250MHz clock
==============================================================
| |
address and | two 32-bit data address and | two 32-bit data
word count | words sent word count | words received
received in | in one clock sent in one | in one clock
one clock | clock |
| |
+-----+-----+-----+-----+ +-----+-----+
| | | | | | I/O device|
+-----+-----+-----+-----+ | receiving |
| RAM memory that is | | word-count|
| four 32-bit words wide| | words of |
| and has output | | data |
| registers for four | +-----------+
| words |
+-----------------------+
The system operates by the I/O device sending an address and
word count to the memory. The memory uses the address to start
a memory access that brings four 32-bit words into the memory
output registers. In parallel, [the memory starts the access of
the next four words] and [sends two words on the bus, then two
more words on the bus, then a bus idle, then another bus idle].
Thus, the bus actions are overlapped with the next memory access.
Note that the bus may be used by other devices between these
four clock operations on the bus.
The memory access is non uniform. Upon receiving an address,
the first memory access takes longer than the following
memory accesses. This happens in some memories due to the
extra time to charge word or block select lines. Once started
with an address and word count, the memory puts data on the bus
until the word count is satisfied.
For this exercise the first memory access requires 36 nanoseconds
and each additional memory access in the same transaction requires
24 nanoseconds.
A bus transaction starts with the sending of an address and
word count. The transaction ends when the last word and two idles
are received by the I/O device. The transaction time does not
include the one clock to send the address and word count.
Bandwidth is measured in megabytes per second.
The address and word count are not included in the byte count
and not included in the time.
For the I/O device to get 512 words when the I/O device uses
4 as the word count.
a) Compute the total time from receipt of the address to the
end of the last transaction.
b) Compute the transactions per second.
c) Compute the bandwidth.
For the I/O device to get 512 words when the I/O device uses
16 as the word count.
d) Compute the total time from receipt of the address to the
end of the last transaction.
e) Compute the transactions per second.
f) Compute the bandwidth.
Show your work as formulas or as tables.
Be consistent. Use either clock counts or nanoseconds.
Obviously a 250MHz clock uses 4 nanoseconds per clock.
Reading assignments:
Sections 4.5, 3.3, B.5-6 on CD, 3.4, 3.5, 3.6
7.3, 7.4, 7.5, 8.4, 8.5, 8.6, 9.2
Homework assignments 2, 4 through 12
Project: part1 and part2 (part 2 schematic attached)
General circuit, truth table and VHDL questions.
Last updated 5/18/05