ARM Shared Memory
The ARM Cortex A57 uses a modification of the ESI protocol we cover in class, called the MOESI protocol. This adds "Modified" and "Owned" states that allow data to be shared directly from one core's L1 cache to another core, without having to go through the L2 cache. If the data in a block in the Owned state is dirty, it is written back to L2 when the Owned block becomes Invalid (from eviction due to cache conflict, or write request from another core).
Here’s a summary of the states.
- Modified: Data is dirty (not matching L2 copy) and writable. Only one core can be in this state, all others must be Invalid. If another core issues a read request, change to Owned.
- Owned: Data may be dirty, but not writable without changing states. If another core issues a read request, forward the data to the other core, but don’t write back to L2 yet. This saves time waiting on the L2 cache. Only one core can be in this state, all others must be Invalid or Shared.
- Exclusive: Read only, but still the only core with a copy. Can go to Modified without telling other cores. Data is clean (matching L2) and is not writable without changing states. Only one core can be in this state, all others are Invalid. If another cores issues a read request, change to Owned.
- Shared: Read-only access to data. Many cores can be in the Shared state. One other is allowed to be in the Owned state.
- Invalid: No copy of this cache block.
Draw a state diagram for this protocol. To avoid too much state transition spaghetti, I suggest arranging the states in a circle. Including state transitions for CR=This core's CPU requests a Read, CW=This core's CPU requests a Write, BR=see a Read on the Bus from another core, and BW=see a Write on the Bus from another core. Include on each transition what, if anything, the core sends on the bus: W = send Write notification, R = send Read notification, or D = send Data.
Profiling Data
The Core i7 has roughly these cache and memory latencies:
L1 latency | 4 cycles |
L2 total latency | 10 cycles |
L2 penalty | 6 cycles |
L3 unshared latency | 40 cycles |
L3 shared in another core | 64 cycles |
L3 modified in another core | 75 cycles |
Remote L3 | 100-300 cycles |
Local DRAM | 60 ns |
Remote DRAM | 100 ns |
On a 2.6 GHz Core i7, I recorded the following stats for a test program. This is the same ray-tracing program used for the valgrind demo in the first couple of weeks of class, but this time recorded using a sampling profiler and the i7's hardware counters.
Cycles | 47,926,025,373 |
Instructions | 96,604,699,195 |
Branches | 9,791,557,458 |
Mispredicted Branches | 10,399,323 |
L1 Hits | 29,788,371,721 |
L1 Misses | 236,895,473 |
L2 Misses | 58,501,962 |
L3 Misses | 20,018,953 |
The most expensive single function in the program, Sphere::intersect(), was responsible for this subset of the totals
Cycles | 1,463,862,906 |
Instructions | 3,191,721,155 |
Branches | 322,707,628 |
Mispredicted Branches | 290,988 |
L1 Hits | 994,868,818 |
L1 Misses | 6,764,568 |
L2 Misses | 1,689,718 |
L3 Misses | 568,123 |
Basic Ahmdal
- How long does this program take to run?
- How much of that time is in the Sphere::intersect() function?
- What percentage is Sphere::intersect of the total program execution time?
- If Sphere::intersect() could be made 1.15x faster, what would the overall speedup be?
- What would the new total execution time be?
CPI
For each question, give answers for the program as a whole, and just for the Sphere::intersect() function
- What is the CPI?
- This processor should be able to achieve four instructions per cycle. What would the expected execution time be if we could achieve that rate?
- What would the speedup be of achieving 0.25 CPI vs. the actual CPI
Branch Prediction
For each question, give answers for the program as a whole, and just for the Sphere::intersect() function
- What is the branch misprediction rate?
- Assuming a 17 cycle branch misprediction penalty, how many cycles are spent on branch mispredictions?
- What percentage of the execution time is spent on branch misprediction?
- What would the expected speedup and execution time be if you could eliminate the branch mispredictions?
Cache Misses
For each question, give answers for the program as a whole, and just for the Sphere::intersect() function
- What is the average memory access time? Since it is a single program on a single core, the use the unshared L3 time and the local DRAM time.
- How many cycles are spent on cache misses?
- What percentage of the execution time is spent on cache misses?
- What would the expected speedup and execution time be if you could eliminate the cache misses?