Numerous function units and smart load/store processing make the PA-8000 the fastest RISC processor
Dick Pountain
Less than a year after its PA-7200 announcement, Hewlett-Packard again claims to have the world's fastest RISC architecture in its new PA-8000. Once again, the claim looks plausible. The PA-8000 design aims for "sustainable superscalar" operation by employing multiple function units and a radical out-of-order execution strategy that executes four instructions simultaneously most of the time. It's intended to handle clock speeds of up to 200 MHz, which, if achieved, suggests a throughput of 360 SPECint92 and 550 SPECfp92, ahead of both the PowerPC 620 and the Mips R10000. At the time of this writing, the PA-8000 has just been taped out for HP's 0.5-micron, 3.3-v
olt CMOS process, but there's no silicon yet.
The PA-8000's microarchitecture is wholly new, borrowing almost nothing from the PA-7200 implementation, except the blindingly fast (960 Mbps) 64-bit Runway processor bus and the strategy of keeping the instruction and data caches off-chip. The new chip is 64-bit throughout, with 64-bit flat addressing, plus 64-bit floating-point and integer arithmetic. Nevertheless, for compatibility it executes all old 32-bit PA RISC code.
Out-of-Order Execution
The first step towards sustaining the execution rate of four instructions per cycle is to provide ample hardware resources. To this end, the PA-8000 supplies 10 function units: Two integer ALUs, two IUs (integer shift/merge units), two floating-point MAC (multiply accumulate) units, two floating-point divide/square-root units, and two load/store units. See the figure
"The PA 8000's Microarchitecture"
for details. The MAC units have a three-cycle latency and a
re fully pipelined to deliver up to four FLOPs per cycle. The divide units have 17-cycle latency and are not pipelined, but they run concurrently with the MACs.
The real trick is to keep most of these units busy, and it's here that the PA-8000 gets radical. It employs hardware scheduling to extract the maximum parallelism from the instruction stream. Previous two-way superscalar HP designs like the 7200 left scheduling issues to the compiler, but with a four-way superscalar design, this solution is no longer sufficient. That's because four sequential instructions are likely to contain data dependencies that can't be resolved at compile time. Accordingly, the PA-8000 has a deep IRB (instruction reorder buffer), which examines the 56 most current instructions to find four that can execute simultaneously.
The PA-8000's instruction-fetch unit fetches blocks of four quadword-aligned instructions per cycle (exactly matching the maximum execution rate) from the external I-cache. The fetch unit passes t
hem to a sort unit that in turn feeds them into the IRB. The IRB consists of two 28-slot buffers: an ALU buffer that holds instructions destined for the integer units and FPUs, and a memory buffer that holds load/store instructions. Certain instruction types, such as load-and-modifies and branch instructions, go into both buffers.
Once an instruction arrives in an IRB slot, the hardware monitors the instruction stream to the execution units to see whether any of them supplies operands for the stored instruction. This instruction can request to be dispatched only after the last instruction for which it has dependencies has been dispatched. Each of the IRB buffers dispatches two instructions per cycle, and the paired function units are coupled to odd- and even-numbered slots (for example, all even slots use ALU0, and the odd slots use ALU1). In all cases, it's the oldest instruction in the buffer that gets dispatched. When an instruction has been successfully executed--or its trap status becomes known--i
t's retired from the IRB in program order. Up to four instructions per cycle can be retired.
The PA-8000 employs register renaming via 56 rename registers (one for each IRB slot) and 64 architectural registers (32 integer and 32 floating-point). This enables the PA-8000 to execute (but not retire) many instructions speculatively, without corrupting the processor state if the speculation proves false and all of them have to be scrapped. This is used to hide branch delays and other latencies (see "The PA-8000's Microarchitecture"). Exception traps also get signaled at retire time, which means that the PA-8000 can maintain a precise exception model despite its out-of-order execution.
Loads and Stores
The PA-8000 tries hard to eliminate the performance penalties that load-store dependencies cause. The commercial data processing applications that HP targets with its PA chips use large data sets and require correspondingly large data caches (up to 4 MB on the PA-8000) to achie
ve good throughput. With such a large external cache, loading data from the cache requires several cycles. This means that an instruction that needs the result of that load may have to wait, which on an in-order machine would stall the pipeline. Out-of-order and speculative execution can hide these delays.
When a load or store instruction in an IRB slot has received all its operands, it requests to be dispatched, just like an ALU instruction, but the destination is one of the address adders, to calculate its effective address. The calculated address gets stored into a third 28-slot buffer, called the ARB (address reorder buffer), whose slots are associated one-to-one with the slots of the IRB's memory buffer (
see the figure
). The effective address also goes to the TLB (translation look-aside buffer), which returns a physical address that's placed into the same ARB slot.
With its address in the ARB, the load/store instruction starts arbitrating for access to one of the banks
of synchronous SRAM (static RAM) that make up the dual-ported data cache. The instruction tries again each successive cycle until it wins access. (Arbitration is based on the age of the original load/store instruction, not the time its address has been in the ARB, with priority to the oldest.) If access is granted on the first attempt, load data arrives on chip three cycles after the dispatch of the address calculation.
Other operand-dependent instructions in the IRB are kept informed of the status of loads in progress, and they won't dispatch themselves until their load wins cache access. This leaves the function units free to run any younger instructions whose operands are ready, and so the load delay can usually be concealed.
The ARB hardware also checks for store-to-load dependencies. Whenever a store instruction has its effective address calculated, it's compared to the addresses of any younger load instructions that have completed their cache accesses (by executing out-of-order). If it's
the same address, then that load and all younger instructions are flushed from the IRB and reexecuted. Similarly, whenever a load instruction calculates its address, the addresses of all older stores in the IRB are compared. In the event of a match, the load waits until the store data becomes available. These mechanisms ensure that out-of-order execution can't cause stale data to be read.
When a store instruction retires, its value gets copied from a register into the Store Queue, a FIFO (first in/first out) write buffer with room for 11 doublewords for each cache bank. This queue's contents get written out to the data cache during idle cycles or when other stores are performing tag lookups. Using these otherwise wasted cycles reduces the likelihood that a load will be held up due to cache contention with a store.
Predictions and Speculations
Like most of the current generation processor architectures, including Intel's P6, the PA-8000 uses target address caching, branch
prediction, and speculative execution to minimize the pipeline breaks caused by changes of control flow, implemented via both static and dynamic branch prediction schemes.
The PA-8000 indulges in several forms of speculative execution. It executes instructions from the predicted arm of every conditional branch but doesn't retire them until the branch condition is resolved. It executes younger instructions before it knows the exception status of older ones. And it executes younger load instructions while an older store is still pending. In the event of failure (i.e., the branch was predicted wrongly, the older instruction trapped, or the store was to the load address), all younger instructions in the IRB must be discarded. These cases are rare, and most often "playing the hunch" pays off in time savings.
The Last PA RISC?
In view of the new partnership between HP and Intel, it's likely that the PA-8000 will be the last PA architecture from HP. There already seems to be s
ome convergence between Intel's P6 and HP's PA-8000, particularly in the area of the out-of-order execution hardware, and it becomes less difficult to imagine a hybrid between the two architectures.
It's interesting to note that both Intel and HP have made a decisive move toward intelligent hardware scheduling and that both are relying less on smart compiler technology than are many other RISC vendors. This is the exact reverse of the trend toward VLIW (very long instruction word)--which relies heavily on smart compilers--that many market watchers predicted when the companies first announced their partnership. This is probably due to the fact that both firms have large installed bases of legacy software, and hardware scheduling, as implemented by the PA-8000's IRB, boosts old code performance more simply and more effectively than compiler-based tricks can, as witnessed by the generally disappointing performance of recompiled Pentium programs.
illustration_link (15 Kbytes)

The large instruction reorder buffer (center) tracks instruction dependencies and only dispatches those with resolved operands to the function units.
illustration_link (8 Kbytes)

The ARB (address reorder buffer), whose slots are associated one-to-one w
ith the slots of the IRB's memory buffer.
Dick Pountain is a BYTE contributing editor based in London, U.K. You can reach him on the Internet or BIX at
dickp@bix.com
.