This is an architecture diagram for the ARM Cortex-A57 processor (from PC Watch by way of Anandtech). This processor can issue up to three ARM instructions per cycle (3-way Instruction Decode), and issue up to 8 micro-operations per cycle to the execution units using a variation of Tomasulo's algorithm.
Pipeline
Pipeline depth
What is the ideal pipeline speedup over single cycle for simple integer ALU instructions considering the architecture as just a simple pipeline (only issued one instruction per cycle, and no out of order completion)?
Multiple issue
What is the speedup over single-cycle for simple integer ALU instructions if you assume three instruction issued per cycle and no stalls or misprediction penalties?
Branching
Branch Penalty
What is the branch penalty in cycles?
Branch Prediction
Of the things in the Branch Prediction box, which are used to predict branch direction and which are used to predict branch target address?
CPI with branching
Assuming sustained issue of three instructions per cycle, and no stalls due to data hazards, what is the expected CPI if 20% of the instructions are branches and branch prediction achieves 95% accuracy?
Cache
L1 Cache addressing
What is the breakdown of a 48-bit virtual addresses into tag, index and offset for the L1 instruction cache? For the L1 data cache?
L2 Cache addressing
What is the breakdown of a 44-bit physical address into tag, index, and offset for a 2 MB L2 cache with 64-byte cache lines?
Memory access timing
Assume an A57 running at 2 GHz; L1 access time of 2 ns with a 90% hit rate; L2 hit time of 9 ns with a 95% hit rate; and memory with an access time of 154 ns. What is the average memory access time?
CPI with branching and memory
What is the total expected CPI including memory access stalls and branch penalties (from 2C) for a program with 15% loads, 2% stores?