CPU Architecture, or why Data Oriented Programming

Models
1. CS is built on simplified models
  1. You don't need to understand differential equations of transistor behavior to program
  2. PRAM for MIMD parallel, SIMD for GPU
  3. Allow you to reason about performance
2. Sometimes the model misses something important
  1. Code is not as fast as it could be
  2. Trying to make code faster makes it slower
3. CPU model
  1. Executes instructions in order
  2. Operations are fixed cost
    - Or a few simple classes (simple vs. transcendental)
  3. This is pretty wrong
Bandwidth vs. Latency (network terms)
1. Latency = how long until something completes (1st byte transmission time)
2. Bandwidth = how many per second (transmission rate)
3. Memory
  1. Optimized for bandwidth
  2. Expensive to fetch a row, cheaper for more in row
4. CPUs
  1. Optimized for throughput (= "bandwidth"), instructions/second
  2. Many instructions in flight
    1. Pipelining
    2. Multiple instructions issued per cycle (7 µops on Core)
    3. Multiple instructions retired per cycle
    4. Haswell: Up to 192 instructions in flight
    5. Start 2-4 simple integer operations per cycle
  3. Latency for CPUs is how long between dependent instructions
    1. Block if not ready, but keep working on other instructions
  4. Also (longer) time in flight: issue to retire

Memory

Stats (from chadaustin.me/2015/04/thinking-about-performance)
Cycle: .25-.5ns

Level	Latency	Equivalent	Granularity	Size	Associativity
Disk	10ms (~20M cycles)		access by page	TBs	1
SSD	50-250 µs (~100k cycles)		access by page	TBs	1
Memory	~200 cycles	atan	access by cache block	GBs	Full
L1	3-4 cycles	ALU	access by word	10s of KB	1-8
L2	10-15 cycles	DIV r8	access by cache block	MBs	2-8
L3	~40 cycles	DIV r64	access by cache block	10s of MB	4-16

What to do
1. Disk & SSD: do something else
2. Memory
  1. Dependent instructions wait
  2. Doesn't take too long before all instructions are dependent
3. Cache
  1. Likely to reuse data (e.g. variables), so keep some in small fast memory
  2. Find by bottom bits of address, associativity helps with conflicts
4. Speculation
  1. Cheap to get neighboring bytes, so do it in case you need them
  2. Recognize patterns (sequential fetch) & prefetch blocks
  3. If you're wrong, just replace them later
Change in model
1. Local, coherent memory accesses cheap
2. Random memory access blocks all instructions in flight
Change in programming
1. Group related data
2. Avoid unnecessary pointer dereferencing
  - C++ -> referred to as "cache miss operator"
3. Avoid virtual inheritance
Example: linear vs. binary search
1. Linear: use full cache block, prefetch-friendly
2. Binary: use only part of each cache block, unlikely to prefetch, asymptotically better
3. Show data: crossover about 32
4. Show also quadratic vs linear
Secondary cache concerns
1. Simple hash & associativity = power of 2 data access can be bad
2. Communication with other cores synchronized at cache-block level

Branching
1. Branch stops issue until resolved
  1. Figure out branch target
  2. Is branch taken?
2. Speculation / prediction
  1. Make a guess
  2. Issue those instructions
  3. Don't retire until you know if you were right
  4. Throw out if wrong
  5. Penalty up to all instructions in flight (on par with memory)
3. Importance
  1. Many programs branch a lot
  2. gcc ~20% branches (1 in 5 instructions!)
4. Strategies
  1. 1st time, no info: assume not taken
  2. Assume branch is consistent
  3. Assume branch is correlated with other branches
  4. Assume branch is a loop w/ consistent count
  5. Assume branch is a function return
5. Change in model
  1. Predictable branches are cheap
  2. Unpredictable branches are expensive
6. Change in programming
  1. Avoid unnecessary branching
  2. Try to make branches more predictable
Algorithmic analysis
1. O() based on large data, assumes constants don't matter
  1. Constants are a function of data access & branching
    - Changing the algorithm changes the constants
  2. Constants can be 1-200
  3. "Best" algorithm may not win for small data
2. Change in programming
  1. Be aware of data size
  2. Choose algorithm based on data size
  3. Reorganize data to reduce constants
    - Example of implicit tree