Six execution units and dynamic branch prediction highlight IBM/Motorola's new processor
Bob Ryan and Tom Thompson
With the backing of parents IBM, Motorola, and Apple, the PowerPC architecture has the best chance of taking significant market share away from the 80x86. The appeal of the original PowerPC--the 601--was that it offered the performance of the original Pentium for the price of a 486. The latest PowerPC, the 604, raises the stakes by offering performance that is 50 percent better than the latest Pentium, the 100-MHz P54C.
Pricing hadn't been established at the time of this writing, but Russell Stanphill, codirector of the IBM/Motorola Somerset design facility, says that 604 pricing will maintain the price/performance advantage that the Powe
rPC in general enjoys over the 80x86. The P54C currently sells for $995 per thousand and will likely be under $900 by fall. Despite its higher performance, the 604 will have to beat this price significantly to make reasonably priced 604-based personal computers possible.
At 100 MHz and with 1 MB of secondary cache and a 66-MHz bus, 604 simulations provide 160 SPECint92 and 165 SPECfp92, easily eclipsing the 100 SPECint92 and 80 SPECfp92 of the P54C. The design team achieved this performance by incorporating six separate execution units into the processor, three more than the 601 and two more than the 603. In addition, the 604 employs dynamic branch prediction to keep the execution units filled with instructions, yet it retains full compatibility with the other PowerPC processors.
The 604 will sample in the third quarter of this year and ship in at least limited-volume quantities in the fourth quarter. You can expect to see systems from Apple and IBM based on the part late this year or early next
year.
Inside 604
With personal computers--as opposed to workstation-class machines--integer performance is everything, and the 604 microarchitecture betrays its PC leanings. Of its six execution pipelines, three are dedicated to integer functions. Two of the integer units handle single-cycle, register-to-register instructions, while the third is a more complex three-stage pipeline that handles integer multiplies and divides. The 604 is the first MPU (microprocessor unit) to offer three integer pipelines.
The other execution units in the 604 are an IEEE 754 compatible FPU, a load/store unit that moves data between registers and memory, and a branch unit that handles changes in the flow of instructions into the processor. The execution units are fed instructions and data by separate 16-KB, four-way set-associative instruction and data caches, which in turn communicate off-chip through the bus interface unit.
The 604 fetches up to four instructions per cycle from the instruction cache. I
t deposits these into an eight-entry prefetch/decode buffer. The bottom four entries of this buffer decode the instructions--that is, they determine the resources that each instruction requires.
Once decoded, up to four instructions per cycle move to the dispatch buffer. Here, the dispatch logic assigns a rename buffer as the destination for any writes to a register that an instruction makes, and it reads the instruction's operands from architectural registers or previously assigned rename buffers. The 604 contains 12 rename buffers for the 32 GPRs (general-purpose registers), and eight for the 32 FPRs (floating-point registers). To handle all the register accesses required by the multiple execution units, the 604 provides eight integer, three floating-point, and one condition register read ports. The use of rename buffers keeps instructions that are executing speculatively from updating architectural registers.
The dispatch logic also assigns each instruction an entry in the 16-entry reorder bu
ffer, which tracks the status of every instruction--including whether the instruction is executing speculatively--from dispatch to completion. Thus, the 604 can have no more than 16 instructions executing--speculatively or otherwise--at any one time. If the reorder buffer is full, dispatch stops until one or more entries become available.
Dispatch and Execution
From the dispatch buffer, instructions move to the execution units--up to four per cycle, although no more than one per execution unit per cycle. Instructions are dispatched in program order, and no instruction can dispatch after a branch instruction in the same cycle.
Each execution unit is fronted with a two-stage reservation station to keep a stalled execution unit from blocking dispatch to other units. If the execution stage is busy, instructions wait here until the first stage is clear. They also wait here if any of their operands are not available. A data-forwarding mechanism feeds the reservation stations, allowing operands to b
e made available to follow-on instructions before the instructions that produce them complete writeback. Note that if an instruction with all its operands available is dispatched to an idle execution unit whose reservation station is empty, the instruction issues immediately to the execution stage.
Instructions issue from the reservation stations to the individual execution units. With the branch, floating-point, and load/store units, instructions issue in order from the station to the unit. With the integer units, an instruction can issue out of order from the station to the unit. Thus, the 604 supports in-order dispatch; in-order issue within the branch, load/store, and floating-point units; out-of-order issue in the integer units; and out-of-order execution.
The 604 uses the reorder buffer to ensure that instructions complete in program order, thus ensuring the integrity of the architectural model. For example, if an instruction executing out of order causes an exception, the exception is not
ed in the reorder buffer and isn't handled until the instruction is retired from the buffer. Exceptions are always handled in program order.
Instruction Reflow
Unlike the 601 and 603, the 604 uses dynamic branch prediction to minimize delay when the normal sequential flow of instructions is changed by a branch instruction. Dynamic branch prediction is more complex than the static kind, but it adapts better to the run-time environment, thus ensuring better predictions.
Branch-prediction logic makes educated guesses about the outcome of a branch and begins fetching instructions from the predicted branch address before the condition the branch is based on is even tested. A misprediction, of course, results in a delay, but correctly predicted branches can result in zero-delay branching--the nirvana of every CPU designer.
Both the 601 and 603 rely on a "hint" bit in the coding of branch instructions to determine the direction of a branch. The 604 ignores this bit. Instead, it contains a 512
-entry, direct-mapped BHT (branch history table) that maps four branch-prediction states: strong taken, weak taken, weak not-taken, and strong not-taken. The predicted state for a particular branch instruction is set and modified based on the history of the instruction.
The BHT feeds the BTAC (branch target address cache) with both the address of a branch instruction and the target address of the branch. The BTAC--a fully associative, 64-entry cache--stores both the address and the target of previously executed branch instructions. During the fetch stage, this cache is accessed by the fetch logic. If the current fetch address--the address used to get the next instruction from the cache--matches an address in the BTAC, then the branch target address associated with the fetch address is used instead of the fetch address to fetch instructions from the cache. Talk about cutting out the middle man.
Instructions fetched and executed based on a branch prediction are considered speculative until the br
anch is resolved; that is, until it is known whether the branch prediction was accurate. No speculative instruction is permitted to update the architectural state. The 604 lets instructions execute speculatively through writeback, but only nonspeculative instructions can be retired by the completion unit.
Instruction Finale
The 604 lets instructions execute out of order and execute with up to two levels of speculation, but it doesn't let such instructions affect the architectural state of the processor. In other words, such instructions cannot affect any user-visible registers or memory locations. Registers and memory can be touched only by nonspeculative instructions in program order. This rule is enforced by the completion unit. (The exceptions to this rule are the Counter and Link registers used by branch instructions. These employ shadow registers to back out of mispredicted branches.)
The completion unit uses the reorder buffer to retire instructions. It retires instructions in program o
rder, up to four instructions per cycle. It won't retire an instruction that is labeled speculative, nor will it retire one that executed out of order unless all previous instructions have been retired. The completion unit knows the order of instructions because this information is supplied to the reorder buffer when the instruction is dispatched.
Likewise, the 604 has an internal mechanism to mark instructions executing speculatively and remove the marking when the speculative branch is resolved. Of course, if the branch is found to be mispredicted, the mechanism must be able to expunge the speculative instructions from the pipeline and the reorder buffer and to invalidate writes such instructions made to the rename buffers. Due to pending patent applications, the Somerset design team declined to give details of this internal tracking mechanism other than to characterize it as relatively straightforward.
In the future, you will see faster versions of the 604 as IBM and Motorola improve their pr
ocessor technology. The 604 also gives strong indications of how the PowerPC 620--the workstation member of the PowerPC line due in the fall--will play out. The 620 will issue six instructions per cycle, at least two of which will undoubtedly be floating-point instructions. At present, however, the 604's integer performance is a perfect fit with the types of applications that dominate desktop computing.
604 Facts
-- 100-MHz processor clock
-- 1/x, 2/3x, 1/2x, 1/3x bus clock
-- 32-bit address bus
-- 64-bit data bus
-- 3.3 V
-- Less than 10 W
-- 3.6 million transistors
-- 196-mm superscript 2 die
-- 0.5-micron 4-metal layer
-- 304-pin CQFP
-- Six execution units
-- Split caches
physically indexed
physical tags
MESI cache coherency
128-entry two-way TLBs
-- Load/store queues
Illustration: The 604 Microarchitecture
Bob Ryan is a BYTE technical editor, and Tom Thompson is a BYTE senior technical editor at larg
e. You can contact them on the Internet or BIX
at
b.ryan@bix.com
or
tom_thompson@bix.com
, respectively.