Snooping Shared Memory
The ARM Cortex A57 uses a modification of the ESI protocol we cover in class, called the MOESI protocol. This adds "Modified" and "Owned" states that allow data to be shared directly from one core's L1 cache to another core, without having to go through the L2 cache. If the data in a block in the Owned state is dirty, it is written back to L2 when the Owned block becomes Invalid (from eviction due to cache conflict, or write request from another core).
Here’s a summary of the states.
- Modified: Data is dirty (not matching L2 copy) and writable. Only one core can be in this state, all others must be Invalid. If another core issues a read request, change to Owned.
- Owned: Data may be dirty, but not writable without changing states. If another core issues a read request, forward the data to the other core, but don’t write back to L2 yet. This saves time waiting on the L2 cache. Only one core can be in this state, all others must be Invalid or Shared.
- Exclusive: Read only, but still the only core with a copy. Can go to Modified without telling other cores. Data is clean (matching L2) and is not writable without changing states. Only one core can be in this state, all others are Invalid. If another cores issues a read request, change to Owned.
- Shared: Read-only access to data. Many cores can be in the Shared state. One other is allowed to be in the Owned state.
- Invalid: No copy of this cache block.
Draw a state diagram for this protocol. To avoid too much state transition spaghetti, I suggest arranging the states in a circle. Including state transitions for CR=This core's CPU requests a Read, CW=This core's CPU requests a Write, BR=see a Read on the Bus from another core, and BW=see a Write on the Bus from another core. Include on each transition what, if anything, the core sends on the bus: W = send Write notification, R = send Read notification, or D = send Data.
Directory Shared Memory
The Origin2000 was a series of ccNUMA directory protocol shared memory machines created by SGI/Cray in the late 1990’s. The Origin2000 architecture could support systems with up to 1024 processors. The network organization for this system gave fastest access within pairs of nodes, then sets of four, then eight, then 16, etc.
The Origin2000 used a 32-bit MIPS R10000 processor with a 32KB 2-way associative L1 cache with 64-byte cache blocks. The L2 cache was 4MB of external SRAM, 2-way associative with 128-byte cache blocks . At the processor speed (a whopping 195MHz), cache and memory latencies for this system were as follows:
Level | ns | clocks |
---|---|---|
L1 cache | 5.1 | 1 |
L2 cache | 56.4 | 11 |
local memory | 310 | 61 |
4P remote memory | 540 | 106 |
8P remote memory | 707 | 138 |
16P remote memory | 726 | 142 |
32P remote memory | 773 | 151 |
64P remote memory | 876 | 169 |
128P remote memory | 945 | 185 |
Average Memory Access Time
Write an expression for the average memory access time on a 8 processor (8P) machine, assuming appropriate miss rates as variables
Directory Protocol
They used a more complex protocol than the 3-state one we covered in class, but for the purposes of this question, assume they did use that 3-state protocol. If the memory access is not local, and another processor has the exclusive copy, what sequence of messages is necessary for a read?
Transfer of Ownership
The TLB entry for each virtual page includes which node is the remote owner for that page. The Origin2000 included the ability to migrate blocks from one owner to another to improve access locality and reduce directory traffic. To support this, they added a new directory state, poisoned. When a block migrated, the former owner marked the block as poisoned in its directory. If a request came in to the former owner for a block in the poisoned state, it could only mean some processor had an out-of-date TLB entry for the block. The former owner would return a response to invalidate the TLB entry and retry, forcing the local system to refresh that TLB entry and get the correct new owner from the OS. Write a possible sequence of CPU reads, page faults, directory protocol messages, and TLB updates that could implement this for a read to a shared block where the processor does not have a local copy of the block, but does have an outdated entry in its TLB.