The Alpha 21164 puts DEC firmly at the top of the performance pyramid
Bob Ryan
When talking about the new Alpha 21164 from DEC, it's impossible to avoid using superlatives. The 21164 is the fastest microprocessor in the world. It contains the most transistors and, coincidentally, also has the largest-capacity on-chip caches. It's the first general-purpose MPU (microprocessor unit) with an on-board second-level cache. Finally, it has the fastest clock of all commercial microprocessors.
At 300 SPECint92 and 510 SPECfp92, the 21164 far outclasses current-generation microprocessors such as the HP-PA 7200, the IBM Power2, and DEC's own Alpha 21064A, all of which deliver in the neighborhood of 175 SPECint92. The 21164 delivers three times the integer performance of
the 100-MHz Pentium and 66 percent more floating-point power than the Mips R8000/8010, a processor specifically designed for floating-point-intensive operations. DEC likes to point out that the 21164 can perform 600 transactions per second, compared to 241 for a dual 66-MHz Pentium-based Compaq ProLiant 2000.
In short, the 21164 is a ``take no prisoners'' microprocessor. It's the first to execute over 1 billion instructions per second (actually 1.2 BIPS, to be exact as you can with such an elusive measure as instructions per second).
By the Numbers
The 21164 has 9.3 million transistors, most of which are for cache memory. Like other Alphas, it has an 8-KB direct-mapped instruction cache and an 8-KB direct-mapped data cache. What makes the 21164 different is its 96-KB, three-way set-associative, unified L2 (level 2) cache. Putting the L2 cache on-chip greatly reduces the average latency of a memory access that misses the primary caches.
The 21164 is a refinement of DEC's RISC philos
ophy. More than any other company, DEC keeps its instructions and processing pipelines simple. This keeps the latency of any stage in the pipeline low and lets DEC boost the clock speed to boost performance. The 21164 runs at two speeds: 266 and 300 MHz. The external bus can run at any integer divisor of the processor clock from 1 to 15. The processor also provides support for an L3 cache.
How It Works
The 21164 contains four execution units and can issue up to four instructions--two integer and two floating-point--per clock cycle. The two integer units are not identical, although each has an ALU and both perform loads. One unit--E0 in DEC nomenclature--has the necessary circuitry to perform stores, shifts, and integer multiplies. The other unit, E1, handles branch processing in addition to common integer instructions.
The FPUs also differ from one another. The floating-point add pipeline, FA, handles addition, division, and floating-point conditional branches; FM, the multiplication pipe
line, does the multiplying. The 21164 contains both an integer-register and a floating-point-register file. To handle multiple, simultaneous accesses from the execution units, the integer-register file has four read ports and two write ports, while the floating-point-register file has five and four ports, respectively.
Like earlier Alphas, the 21164 features fairly deep pipelines. The first four stages are common to all instructions and occur in the instruction unit. The integer units add three stages to instruction processing, for a total of seven stages; the floating-point units require five stages to perform their functions.
The instruction unit consists of the following stages: instruction prefetch, buffer, and decode--including branch prediction, slotting, and instruction issue. In the prefetch stage, the instruction unit retrieves four instructions at a time from the instruction cache. It next checks for branches and predicts them based
on 2 history bits. The third stage of instruct
ion processing slots four instructions for issuing. If these four instructions can't issue to four different execution units, the second stage stalls until all four of the current instructions are issued. The instruction unit's final stage checks operand registers for dependencies and reads the integer-register file. Again, all preceeding stages will stall if any instruction in this stage can't be issued. All source operands must be available by the end of this stage for the instruction to be able to move to execution.
The four stages in the instruction unit are static; instructions can remain stalled there for as long as necessary to clear any functional or data dependencies. But the execution units are dynamic. Once issued to an execution unit, only those instructions with multicycle latencies spend more than one cycle in each stage.
Execution Time
Because it doesn't issue an instruction until all dependencies are satisfied or issue instructions out of order, the 21164 has a very simple
back end. Unlike with processors such as the PowerPC 604, which can issue instructions out of order and use rename buffers and registers to avoid data dependencies, the 21164's execution units update the architectural registers directly.
The 21164 doesn't need a complicated mechanism to track instructions or a completion unit to ensure that architectural registers are updated in the proper order. Its direct approach to retiring instructions is in tune with the Alpha philosophy of pushing clock speeds to increase performance.
Waiting for instructions to proceed to the writeback stage before making their results available to subsequent instructions
can introduce bubbles into the execution pipelines, especially considering the strict rules about issuing instructions only when all operands are available. To avoid such bottlenecks, the 21164 comes with bypass routes that make operands available before the writeback stage occurs. These bypasses are analogous to--though more extensive than--the
feed-forwarding techniques used in other processors, and they are important to Alpha operation.
With its faster clock, larger number of execution units, and greater instruction-issue rate, the 21164 has a lot going for it compared to the 21064 and 21064A. DEC didn't stop there, however; it also improved the performance of some key operations. For example, the 21164 reduces the latency of floating-point operations from six cycles to four, and L1 data-cache accesses have been cut from three cycles to two.
Such cycle counts may still seem high compared to those of other processors--many take just one cycle to access the data cache, for example--but remember that the 21164's clock ticks much faster. Two cycles on the 21164 take less time than one cycle on the 100-MHz PowerPC 604, which means that cache lookup is actually faster on the 21164. Of course, because the PowerPC 604 has larger, more complex caches, it has a higher hit rate. Such are the trade-offs that microprocessor designers face.
To Market
The 21164 comes in a 499-pin ceramic PGA (pin-grid array) with an integrated slug for mounting a heat sink. It's built with the same 0.5-micron process (for a 0.35-micron effective line length) used for the 21064A. Samples will ship in October, with the 266-MHz version available in at least limited volumes in January. The 300-MHz version will be available in volume in March.
DEC believes it can meet this aggressive schedule because the 21164 is being produced on a tried-and-true process. DEC will also have a core logic/PCI (Peripheral Component Interconnect) chip set available at the same time as the 266-MHz version of the 21164, and an evaluation board will be available in December.
The 266-MHz version of the 21164 will sell for $1865 each in lots of 5000, while the 300-MHz version will go for $2669 each, about what you'd currently pay for three 100-MHz Pentiums. This pricing reflects DEC's strategy to offer single-chip performance that no other vendor can.
While the 2
1164's performance advantage will shrink soon with expected announcements about new UltraSparc, Mips, and PowerPC processors, it's highly unlikely that any of these will best 300SPECint92. The Alpha's performance lead seems secure for a long time. Also, a move to DEC's 0.35-micron process, which should be on-line sometime next year, should provide the 21164 with a nice midlife die shrink, which will certainly make it less expensive to produce and may lead to increased performance.
While the 21164 will undoubtedly appear in DEC systems that run Unix and VMS, the company is concentrating its merchant chip efforts on Windows NT. The Alpha architecture leads Mips in the number of supported NT applications, and it enjoys an 18-month to two-year advantage over NT on the PowerPC. If a high-end desktop-and-server market for NT does develop, then DEC's future will be brighter than its immediate past.
Illustration: 21164 Microarchitecture
With the Alpha 21164, DEC keeps things clean and simple, relyi
ng on a fast clock rather than more complex instructions (and instruction processing) to keep performance high. The most striking aspect of the 21164 is its on-board L2 cache.
Illustration: Alpha Pipelines
The 21164's instruction unit enforces all issue and execution rules. Once an instruction moves to an execution unit, it's ready to fly through the pipeline.
Bob Ryan is a BYTE senior technical editor. You can reach him on the Internet or BIX at
b.ryan@bix.com
.