# **CMSC 611: Advanced Computer Architecture**

**Complex Parallel Systems** 

## Some Graphics Examples

- Pixel-Planes 4
- Pixel-Planes 5
- Pixel-Flow
- NVIDIA GeForce 6 series
- NVIDIA GeForce 8 series
- Intel Larrabee

#### Pixel-Planes 4

 512x512 SIMD array (full screen)



#### Pixel-Planes 5

- Message-passing
- ~40 i860 CPUs
- ~20 128x128 SIMD arrays (~80 tiles/screen)



Fuchs, et al., "Pixel-Planes 5: A Heterogeneous Multiprocessor Graphics System Using Processor Enhanced Memories", SIGGRAPH 89

#### Pixel-Planes 5



#### Pixel-Flow

- Message-passing
- ~35 nodes, each with
  - 2 HP-PA 8000 CPUs
  - 128x64 SIMD array (~160 tiles/screen)



Eyles, et al., "PixelFlow: The Realization", Graphics Hardware 1997

#### Pixel-Flow



# PC Graphics Cards



#### **NVIDIA 7800 / G70**



#### **NVIDIA 7800 / G70**





#### **NVIDIA G80**



# **Streaming Processors**



## Intel Larrabee

| .g                   | In-Order<br>CPU core        | In-Order<br>CPU core |  | In-Order<br>CPU core | In-Order<br>CPU core | Interfaces |
|----------------------|-----------------------------|----------------------|--|----------------------|----------------------|------------|
| go                   | Interprocessor Ring Network |                      |  |                      |                      | erfe       |
| Fixed Function Logic | Coherent<br>L2 cache        | Coherent<br>L2 cache |  | Coherent<br>L2 cache | Coherent<br>L2 cache | I/O Inte   |
| 1 Func               | Coherent<br>L2 cache        | Coherent<br>L2 cache |  | Coherent<br>L2 cache | Coherent<br>L2 cache | 8          |
| ı xe                 | Interprocessor Ring Network |                      |  |                      |                      | non        |
| 臣                    | In-Order<br>CPU core        | In-Order<br>CPU core |  | In-Order<br>CPU core | In-Order<br>CPU core | Memory     |

#### Larrabee Core



## Larrabee: In Order Core

| #CPU Cores        | 2 out-of-order | 10 in-order    |
|-------------------|----------------|----------------|
| Instruction issue | 4 per clock    | 2 per clock    |
| VPU per core      | 4-wide SSE     | 16-wide vector |
| Single stream     | 4 per clock    | 2 per clock    |
| Vector            | 8 per clock    | 160 per clock  |

Small, so fit more on chip

#### Larrabee ISA

- x86 base
- Cache (instructions & modes)
  - prefetch
  - early eviction
  - Direct from L1 as fast as registers
- Exposed dual issue
  - 2<sup>nd</sup> restricted set for second instruction
- 4 threads w/ independent registers
- Vector instructions

#### **Larrabee Fixed Function**

- Extra application-specific units
- Texture filtering
  - 12-40x faster than software

| Fixed Function Logic | In-Order<br>CPU core        | In-Order<br>CPU core |  | In-Order<br>CPU core | In-Order<br>CPU core |  | ices             |
|----------------------|-----------------------------|----------------------|--|----------------------|----------------------|--|------------------|
|                      | Interprocessor Ring Network |                      |  |                      |                      |  | erfa             |
|                      | Coherent<br>L2 cache        | Coherent<br>L2 cache |  | Coherent<br>L2 cache | Coherent<br>L2 cache |  | & I/O Interfaces |
| 1 Func               | Coherent<br>L2 cache        | Coherent<br>L2 cache |  | Coherent<br>L2 cache | Coherent<br>L2 cache |  |                  |
| Fixed                | Interprocessor Ring Network |                      |  |                      |                      |  | noı              |
|                      | In-Order<br>CPU core        | In-Order<br>CPU core |  | In-Order<br>CPU core | In-Order<br>CPU core |  | Memory           |

#### Larrabee Size



#### **Larrabee Bandwidth**



# Larrabee Processing

