Into the Core: Intel's next-generation microarchitecture

Wednesday, April 05, 2006

Core's pipeline

Intel hasn't yet released much detailed information on Core's pipeline. What we do know is that it clocks in at 14 stages—the same length as the PowerPC 970's pipeline, about half the length of the Pentium 4 Prescott's ~30-stage pipeline, and a bit longer than the P6 core's 12-stage pipeline. This means that Core is designed for a steady and incremental succession of clockspeed improvements, and not the kind of rapid clockspeed scaling that characterized the Pentium 4.

If I had to guess about the actual makeup of Core's pipeline, I'd guess that it was essentially the same as the P6 pipeline, but with two wire delay stages added to allow for signal propagation and clockspeed scaling. Alternately, the new stages could be an extra predecode and/or decode stage added to accommodate the front end features described below, like macro-fusion, micro-ops fusion, and the beefed up decoding hardware. We'll find out the identity of these stages eventually, when Intel releases more information.

Core's instruction window

Because Core's back end is so much wider than that of its predecessors, its reorder buffer (ROB) has been enlarged to 96 entries, up from 40 on the Pentium M. Core's unified reservation station is has also been enlarged to accommodate more in-flight instructions and more execution units.

Not only has Core's instruction window (ROB + RS) been physically enlarged, but it has been "virtually enlarged," as well. Macro-fusion and micro-ops fusion, both described below, enable Core to track more instructions with less bookkeeping hardware. Thus Core's instruction window is functionally larger than the absolute number of ROB and RS entries would indicate.

Populating this large instruction window with a steady flow of new instructions is quite a task. Core's front end sports a number of innovations that let it keep the instruction window and execution core full of code.

The front-end: instruction decoding

Core sports a number of important new features in its front end, the most conspicuous of which is a new decoding unit that enables the processor to increase the number of x86 instructions per cycle that it can covert to micro-ops.

The following diagram shows the original P6 core's decoding hardware, which consists of two simple/fast decoders and one complex/slow decoder. The two simple/fast decoders decode x86 instructions that translate into exactly one micro-op, a class of instructions that makes up the great majority of the x86 instruction set. The simple/fast decoders can send micro-ops to the micro-op buffer at a rate of one per cycle.

The P6's decoding hardware

The one complex/slow decoder is responsible for handling x86 instructions that translate into two to four micro-ops. For a very small number of rarely used legacy instructions, like string-manipulation instructions, that translate into more than four micro-ops, the complex decoder farms the job out to a microcode engine that can output streams of micro-ops into the micro-op buffer.

All told, the P6 core's three decoders can output a maximum of six micro-ops per cycle into the micro-op buffer, and the decoding unit as a whole can send up to three micro-ops per cycle on to the ROB.

Because Core's dispatch width and execution core have been widened considerably, the old P6 decoding hardware would have been inadequate to keep the rest of the processor fed with micro-ops. Intel needed to increase the decode rate so that more micro-ops/cycle could reach the back end, so Core's designers did a few things to achieve this goal.

The first thing they did was add another simple/fast decoder, which means that Core's decoding hardware can send up to seven micro-ops per cycle to the micro-op queue, which in turn can pass up to four micro-ops per cycle on to the ROB. This new decoder is depicted in the figure below.

Core's decoding hardware

Also, more types of instructions can now use the simple/fast decoders. Specifically memory instructions and SSE instructions that formerly used the complex/slow decoder can now use the simple/fast ones, thanks to the micro-ops fusion and the new SSE hardware (both described below). Thus the new design appears to bring Intel much closer to the goal of 1 micro-op per x86 instruction, a goal that's important for reasons I'll go into shortly.

« Prev

[Vector execution units]

[Instruction fusion]

Into the Core: Intel's next-generation microarchitecture

Core's pipeline

Core's instruction window

The front-end: instruction decoding

« Prev

Next »