Consider a processor with 64-bit words and the following cache stats when running the SPEC CPU 2006 benchmarks (numbers inspired by this hit time source and this miss rate source):
| Level | Data Size | Associativity | Block size | Cache Blocks | Address Bits | Hit Time | Miss Rate | Hit Rate | Miss Penalty | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Tag bits | Index bits | Offset bits | |||||||||
| L1 Inst | 32 KiB | 4-way | 32 B | 3 cycles | 0.4% | ||||||
| L1 Data | 32 KiB | 8-way | 32 B | 4 cycles | 9.5% | ||||||
| L2 | 256 KiB | 8-way | 64 B | 10 cycles | 4% | ||||||
| L3 | 8 MiB | 16-way | 64 B | 40 cycles | 1% | ||||||
| Memory | — | — | — | — | — | — | — | 200 cycles | — | — | — |
Fill in the missing columns in the table.
0x0000000100a1af30 at each of L1, L2 and L3.0x00007fff5f1e5ad8 at each of L1, L2 and L3.Assume a pseudo-LRU replacement policy at all cache levels using one recent bit per block, set if that block has been recently accessed. What are the true sizes of all four caches, including tag and data as well as the valid, dirty and recent bits? Remember that the L1 instruction cache does not need a dirty bit, but all other cache levels do.
Use C++ to simulate the data cache behavior for this code for a matrix/vector multiply:
double r[1000], v[1000], m[1000][1000];
// initialize
for(int i=0; i != 1000; ++i) {
r[i] = 0.0;
v[i] = drand48(); // needs #include<stdlib.h>
for(int j=0; j != 1000; ++j) {
m[j][i] = drand48();
}
}
// multiply
for(int i=0; i != 1000; ++i) {
for(int j=0; j != 1000; ++j) {
r[j] = r[j] + m[j][i]*v[i];
}
}
You can simulate this (or any) cache for the matrix multiply portion of this code by adding something like this:
Cache cache;
// multiply
for(int i=0; i != 1000; ++i) {
for(int j=0; j != 1000; ++j) {
cache.read(&v[i]);
cache.read(&m[j][i]);
cache.read(&r[j]);
cache.write(&r[j]);
r[j] = r[j] + m[j][i]*v[i];
}
}
Assume a cache with the size and associativity as given at the beginning fo this homework, using a write-allocate and write-back policy for all levels.
Assume replacement using a recent-bit pseudo-LRU method. Each access sets the recent bit for that block to true. If all of the other blocks in the set also have their recent bit set to true, clear the recent bit on those other blocks to false (leaving just the current block set). On replacement, choose the first block in the set with a false for its recent bit.
The Cache class should look something like this. Note that it's only necessary to track the cache meta data, as the actual data is still handled by the C++ code, so the CacheLine struct does not need to contain the actual data (as the real cache would), just the tag and status bits.
class Cache {
// meta-data for each cache line
struct CacheLine {
uint64_t tag;
bool valid;
bool dirty;
bool recent;
};
CacheLine L1[L1_INDICES][L1_ASSOCIATIVITY];
CacheLine L2[L2_INDICES][L2_ASSOCIATIVITY];
CacheLine L3[L3_INDICES][L3_ASSOCIATIVITY];
// cache statistics
int L1accesses, L1hits, L1cycles;
int L2accesses, L2hits, L2cycles;
int L3accesses, L3hits, L3cycles;
int memAccesses, memCycles;
public:
Cache(); // initialize
void read(double*); // record a read
void write(double*); // record a write
void stats(); // print stats
};
Submit on paper or electronically (by pushing to your assn5 directory in your git repository) by the beginning of class on the due date. If you do the extra credit, you must submit that part electronically, together with an informative readme.txt file, including the final statistics for both cases as reported by running your program.