CMSC 611, Spring 2018 Homework 6

ARM Shared Memory

The ARM Cortex A57 uses a modification of the ESI protocol we cover in class, called the MOESI protocol. This adds "Modified" and "Owned" states that allow data to be shared directly from one core's L1 cache to another core, without having to go through the L2 cache. If the data in a block in the Owned state is dirty, it is written back to L2 when the Owned block becomes Invalid (from eviction due to cache conflict, or write request from another core).

Here’s a summary of the states.

Modified: Data is dirty (not matching L2 copy) and writable. Only one core can be in this state, all others must be Invalid. If another core issues a read request, change to Owned.
Owned: Data may be dirty, but not writable without changing states. If another core issues a read request, forward the data to the other core, but don’t write back to L2 yet. This saves time waiting on the L2 cache. Only one core can be in this state, all others must be Invalid or Shared.
Exclusive: Read only, but still the only core with a copy. Can go to Modified without telling other cores. Data is clean (matching L2) and is not writable without changing states. Only one core can be in this state, all others are Invalid. If another cores issues a read request, change to Owned.
Shared: Read-only access to data. Many cores can be in the Shared state. One other is allowed to be in the Owned state.
Invalid: No copy of this cache block.

Draw a state diagram for this protocol. To avoid too much state transition spaghetti, I suggest arranging the states in a circle. Including state transitions for CR=This core's CPU requests a Read, CW=This core's CPU requests a Write, BR=see a Read on the Bus from another core, and BW=see a Write on the Bus from another core. Include on each transition what, if anything, the core sends on the bus: W = send Write notification, R = send Read notification, or D = send Data.

Profiling Data

The Core i7 has roughly these cache and memory latencies:

L1 latency	4 cycles
L2 total latency	10 cycles
L2 penalty	6 cycles
L3 unshared latency	40 cycles
L3 shared in another core	64 cycles
L3 modified in another core	75 cycles
Remote L3	100-300 cycles
Local DRAM	60 ns
Remote DRAM	100 ns

On a 2.6 GHz Core i7, I recorded the following stats for a test program. This is the same ray-tracing program used for the valgrind demo in the first couple of weeks of class, but this time recorded using a sampling profiler and the i7's hardware counters.

Cycles	47,926,025,373
Instructions	96,604,699,195
Branches	9,791,557,458
Mispredicted Branches	10,399,323
L1 Hits	29,788,371,721
L1 Misses	236,895,473
L2 Misses	58,501,962
L3 Misses	20,018,953

The most expensive single function in the program, Sphere::intersect(), was responsible for this subset of the totals

Cycles	1,463,862,906
Instructions	3,191,721,155
Branches	322,707,628
Mispredicted Branches	290,988
L1 Hits	994,868,818
L1 Misses	6,764,568
L2 Misses	1,689,718
L3 Misses	568,123

Basic Ahmdal

How long does this program take to run?
How much of that time is in the Sphere::intersect() function?
What percentage is Sphere::intersect of the total program execution time?
If Sphere::intersect() could be made 1.15x faster, what would the overall speedup be?
What would the new total execution time be?

CPI

For each question, give answers for the program as a whole, and just for the Sphere::intersect() function

What is the CPI?
This processor should be able to achieve four instructions per cycle. What would the expected execution time be if we could achieve that rate?
What would the speedup be of achieving 0.25 CPI vs. the actual CPI

Branch Prediction

For each question, give answers for the program as a whole, and just for the Sphere::intersect() function

What is the branch misprediction rate?
Assuming a 17 cycle branch misprediction penalty, how many cycles are spent on branch mispredictions?
What percentage of the execution time is spent on branch misprediction?
What would the expected speedup and execution time be if you could eliminate the branch mispredictions?

Cache Misses

For each question, give answers for the program as a whole, and just for the Sphere::intersect() function

What is the average memory access time? Since it is a single program on a single core, the use the unshared L3 time and the local DRAM time.
How many cycles are spent on cache misses?
What percentage of the execution time is spent on cache misses?
What would the expected speedup and execution time be if you could eliminate the cache misses?