Consider a processor with 64-bit words and the following cache stats when running the SPEC CPU 2006 benchmarks (numbers inspired by this hit time source and this miss rate source):
Level | Data Size | Associativity | Block size | Cache Blocks | Address Bits | Hit Time | Miss Rate | Hit Rate | Miss Penalty | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Tag bits | Index bits | Offset bits | |||||||||
L1 Inst | 32 KiB | 4-way | 32 B | 3 cycles | 0.4% | ||||||
L1 Data | 32 KiB | 8-way | 32 B | 4 cycles | 9.5% | ||||||
L2 | 256 KiB | 8-way | 64 B | 10 cycles | 4% | ||||||
L3 | 8 MiB | 16-way | 64 B | 40 cycles | 1% | ||||||
Memory | — | — | — | — | — | — | — | 200 cycles | — | — | — |
Fill in the missing columns in the table.
0x0000000100a1af30
at each of L1, L2 and L3.0x00007fff5f1e5ad8
at each of L1, L2 and L3.Assume a pseudo-LRU replacement policy at all cache levels using one recent bit per block, set if that block has been recently accessed. What are the true sizes of all four caches, including tag and data as well as the valid, dirty and recent bits? Remember that the L1 instruction cache does not need a dirty bit, but all other cache levels do.
Use C++ to simulate the data cache behavior for this code for a matrix/vector multiply:
double r[1000], v[1000], m[1000][1000]; // initialize for(int i=0; i != 1000; ++i) { r[i] = 0.0; v[i] = drand48(); // needs #include<stdlib.h> for(int j=0; j != 1000; ++j) { m[j][i] = drand48(); } } // multiply for(int i=0; i != 1000; ++i) { for(int j=0; j != 1000; ++j) { r[j] = r[j] + m[j][i]*v[i]; } }
You can simulate this (or any) cache for the matrix multiply portion of this code by adding something like this:
Cache cache; // multiply for(int i=0; i != 1000; ++i) { for(int j=0; j != 1000; ++j) { cache.read(&v[i]); cache.read(&m[j][i]); cache.read(&r[j]); cache.write(&r[j]); r[j] = r[j] + m[j][i]*v[i]; } }
Assume a cache with the size and associativity as given at the beginning fo this homework, using a write-allocate and write-back policy for all levels.
Assume replacement using a recent-bit pseudo-LRU method. Each access sets the recent bit for that block to true. If all of the other blocks in the set also have their recent bit set to true, clear the recent bit on those other blocks to false (leaving just the current block set). On replacement, choose the first block in the set with a false for its recent bit.
The Cache class should look something like this. Note that it's only necessary to track the cache meta data, as the actual data is still handled by the C++ code, so the CacheLine struct does not need to contain the actual data (as the real cache would), just the tag and status bits.
class Cache { // meta-data for each cache line struct CacheLine { uint64_t tag; bool valid; bool dirty; bool recent; }; CacheLine L1[L1_INDICES][L1_ASSOCIATIVITY]; CacheLine L2[L2_INDICES][L2_ASSOCIATIVITY]; CacheLine L3[L3_INDICES][L3_ASSOCIATIVITY]; // cache statistics int L1accesses, L1hits, L1cycles; int L2accesses, L2hits, L2cycles; int L3accesses, L3hits, L3cycles; int memAccesses, memCycles; public: Cache(); // initialize void read(double*); // record a read void write(double*); // record a write void stats(); // print stats };
Submit on paper or electronically (by pushing to your assn5 directory in your git repository) by the beginning of class on the due date. If you do the extra credit, you must submit that part electronically, together with an informative readme.txt file, including the final statistics for both cases as reported by running your program.