CPU Architecture, or why Data Oriented Programming
- Models
- CS is built on simplified models
- You don't need to understand differential equations of transistor behavior to program
- PRAM for MIMD parallel, SIMD for GPU
- Allow you to reason about performance
- Sometimes the model misses something important
- Code is not as fast as it could be
- Trying to make code faster makes it slower
- CPU model
- Executes instructions in order
- Operations are fixed cost
- Or a few simple classes (simple vs. transcendental)
- This is pretty wrong
- Bandwidth vs. Latency (network terms)
- Latency = how long until something completes (1st byte transmission time)
- Bandwidth = how many per second (transmission rate)
- Memory
- Optimized for bandwidth
- Expensive to fetch a row, cheaper for more in row
- CPUs
- Optimized for throughput (= "bandwidth"), instructions/second
- Many instructions in flight
- Pipelining
- Multiple instructions issued per cycle (7 µops on Core)
- Multiple instructions retired per cycle
- Haswell: Up to 192 instructions in flight
- Start 2-4 simple integer operations per cycle
- Latency for CPUs is how long between dependent instructions
- Block if not ready, but keep working on other instructions
- Also (longer) time in flight: issue to retire
- Memory
- Stats (from chadaustin.me/2015/04/thinking-about-performance)
Cycle: .25-.5ns
Level |
Latency |
Equivalent |
Granularity |
Size |
Associativity |
Disk |
10ms
(~20M cycles) |
|
access by page |
TBs |
1 |
SSD |
50-250 µs
(~100k cycles) |
|
access by page |
TBs |
1 |
Memory |
~200 cycles |
atan |
access by cache block |
GBs |
Full |
L1 |
3-4 cycles |
ALU |
access by word |
10s of KB |
1-8 |
L2 |
10-15 cycles |
DIV r8 |
access by cache block |
MBs |
2-8 |
L3 |
~40 cycles |
DIV r64 |
access by cache block |
10s of MB |
4-16 |
- What to do
- Disk & SSD: do something else
- Memory
- Dependent instructions wait
- Doesn't take too long before all instructions are dependent
- Cache
- Likely to reuse data (e.g. variables), so keep some in small fast memory
- Find by bottom bits of address, associativity helps with conflicts
- Speculation
- Cheap to get neighboring bytes, so do it in case you need them
- Recognize patterns (sequential fetch) & prefetch blocks
- If you're wrong, just replace them later
- Change in model
- Local, coherent memory accesses cheap
- Random memory access blocks all instructions in flight
- Change in programming
- Group related data
- Avoid unnecessary pointer dereferencing
- C++ -> referred to as "cache miss operator"
- Avoid virtual inheritance
- Example: linear vs. binary search
- Linear: use full cache block, prefetch-friendly
- Binary: use only part of each cache block, unlikely to prefetch, asymptotically better
- Show data: crossover about 32
- Show also quadratic vs linear
- Secondary cache concerns
- Simple hash & associativity = power of 2 data access can be bad
- Communication with other cores synchronized at cache-block level
- Branching
- Branch stops issue until resolved
- Figure out branch target
- Is branch taken?
- Speculation / prediction
- Make a guess
- Issue those instructions
- Don't retire until you know if you were right
- Throw out if wrong
- Penalty up to all instructions in flight (on par with memory)
- Importance
- Many programs branch a lot
- gcc ~20% branches (1 in 5 instructions!)
- Strategies
- 1st time, no info: assume not taken
- Assume branch is consistent
- Assume branch is correlated with other branches
- Assume branch is a loop w/ consistent count
- Assume branch is a function return
- Change in model
- Predictable branches are cheap
- Unpredictable branches are expensive
- Change in programming
- Avoid unnecessary branching
- Try to make branches more predictable
- Algorithmic analysis
- O() based on large data, assumes constants don't matter
- Constants are a function of data access & branching
- Changing the algorithm changes the constants
- Constants can be 1-200
- "Best" algorithm may not win for small data
- Change in programming
- Be aware of data size
- Choose algorithm based on data size
- Reorganize data to reduce constants