Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesAlpha Rides High


October 1994 / Core Technologies / Alpha Rides High

The Alpha 21164 puts DEC firmly at the top of the performance pyramid

Bob Ryan

When talking about the new Alpha 21164 from DEC, it's impossible to avoid using superlatives. The 21164 is the fastest microprocessor in the world. It contains the most transistors and, coincidentally, also has the largest-capacity on-chip caches. It's the first general-purpose MPU (microprocessor unit) with an on-board second-level cache. Finally, it has the fastest clock of all commercial microprocessors.

At 300 SPECint92 and 510 SPECfp92, the 21164 far outclasses current-generation microprocessors such as the HP-PA 7200, the IBM Power2, and DEC's own Alpha 21064A, all of which deliver in the neighborhood of 175 SPECint92. The 21164 delivers three times the integer performance of the 100-MHz Pentium and 66 percent more floating-point power than the Mips R8000/8010, a processor specifically designed for floating-point-intensive operations. DEC likes to point out that the 21164 can perform 600 transactions per second, compared to 241 for a dual 66-MHz Pentium-based Compaq ProLiant 2000.

In short, the 21164 is a ``take no prisoners'' microprocessor. It's the first to execute over 1 billion instructions per second (actually 1.2 BIPS, to be exact as you can with such an elusive measure as instructions per second).

By the Numbers

The 21164 has 9.3 million transistors, most of which are for cache memory. Like other Alphas, it has an 8-KB direct-mapped instruction cache and an 8-KB direct-mapped data cache. What makes the 21164 different is its 96-KB, three-way set-associative, unified L2 (level 2) cache. Putting the L2 cache on-chip greatly reduces the average latency of a memory access that misses the primary caches.

The 21164 is a refinement of DEC's RISC philos ophy. More than any other company, DEC keeps its instructions and processing pipelines simple. This keeps the latency of any stage in the pipeline low and lets DEC boost the clock speed to boost performance. The 21164 runs at two speeds: 266 and 300 MHz. The external bus can run at any integer divisor of the processor clock from 1 to 15. The processor also provides support for an L3 cache.

How It Works

The 21164 contains four execution units and can issue up to four instructions--two integer and two floating-point--per clock cycle. The two integer units are not identical, although each has an ALU and both perform loads. One unit--E0 in DEC nomenclature--has the necessary circuitry to perform stores, shifts, and integer multiplies. The other unit, E1, handles branch processing in addition to common integer instructions.

The FPUs also differ from one another. The floating-point add pipeline, FA, handles addition, division, and floating-point conditional branches; FM, the multiplication pipe line, does the multiplying. The 21164 contains both an integer-register and a floating-point-register file. To handle multiple, simultaneous accesses from the execution units, the integer-register file has four read ports and two write ports, while the floating-point-register file has five and four ports, respectively.

Like earlier Alphas, the 21164 features fairly deep pipelines. The first four stages are common to all instructions and occur in the instruction unit. The integer units add three stages to instruction processing, for a total of seven stages; the floating-point units require five stages to perform their functions.

The instruction unit consists of the following stages: instruction prefetch, buffer, and decode--including branch prediction, slotting, and instruction issue. In the prefetch stage, the instruction unit retrieves four instructions at a time from the instruction cache. It next checks for branches and predicts them based

on 2 history bits. The third stage of instruct ion processing slots four instructions for issuing. If these four instructions can't issue to four different execution units, the second stage stalls until all four of the current instructions are issued. The instruction unit's final stage checks operand registers for dependencies and reads the integer-register file. Again, all preceeding stages will stall if any instruction in this stage can't be issued. All source operands must be available by the end of this stage for the instruction to be able to move to execution.

The four stages in the instruction unit are static; instructions can remain stalled there for as long as necessary to clear any functional or data dependencies. But the execution units are dynamic. Once issued to an execution unit, only those instructions with multicycle latencies spend more than one cycle in each stage.

Execution Time

Because it doesn't issue an instruction until all dependencies are satisfied or issue instructions out of order, the 21164 has a very simple back end. Unlike with processors such as the PowerPC 604, which can issue instructions out of order and use rename buffers and registers to avoid data dependencies, the 21164's execution units update the architectural registers directly.

The 21164 doesn't need a complicated mechanism to track instructions or a completion unit to ensure that architectural registers are updated in the proper order. Its direct approach to retiring instructions is in tune with the Alpha philosophy of pushing clock speeds to increase performance.

Waiting for instructions to proceed to the writeback stage before making their results available to subsequent instructions

can introduce bubbles into the execution pipelines, especially considering the strict rules about issuing instructions only when all operands are available. To avoid such bottlenecks, the 21164 comes with bypass routes that make operands available before the writeback stage occurs. These bypasses are analogous to--though more extensive than--the feed-forwarding techniques used in other processors, and they are important to Alpha operation.

With its faster clock, larger number of execution units, and greater instruction-issue rate, the 21164 has a lot going for it compared to the 21064 and 21064A. DEC didn't stop there, however; it also improved the performance of some key operations. For example, the 21164 reduces the latency of floating-point operations from six cycles to four, and L1 data-cache accesses have been cut from three cycles to two.

Such cycle counts may still seem high compared to those of other processors--many take just one cycle to access the data cache, for example--but remember that the 21164's clock ticks much faster. Two cycles on the 21164 take less time than one cycle on the 100-MHz PowerPC 604, which means that cache lookup is actually faster on the 21164. Of course, because the PowerPC 604 has larger, more complex caches, it has a higher hit rate. Such are the trade-offs that microprocessor designers face.

To Market

The 21164 comes in a 499-pin ceramic PGA (pin-grid array) with an integrated slug for mounting a heat sink. It's built with the same 0.5-micron process (for a 0.35-micron effective line length) used for the 21064A. Samples will ship in October, with the 266-MHz version available in at least limited volumes in January. The 300-MHz version will be available in volume in March.

DEC believes it can meet this aggressive schedule because the 21164 is being produced on a tried-and-true process. DEC will also have a core logic/PCI (Peripheral Component Interconnect) chip set available at the same time as the 266-MHz version of the 21164, and an evaluation board will be available in December.

The 266-MHz version of the 21164 will sell for $1865 each in lots of 5000, while the 300-MHz version will go for $2669 each, about what you'd currently pay for three 100-MHz Pentiums. This pricing reflects DEC's strategy to offer single-chip performance that no other vendor can.

While the 2 1164's performance advantage will shrink soon with expected announcements about new UltraSparc, Mips, and PowerPC processors, it's highly unlikely that any of these will best 300SPECint92. The Alpha's performance lead seems secure for a long time. Also, a move to DEC's 0.35-micron process, which should be on-line sometime next year, should provide the 21164 with a nice midlife die shrink, which will certainly make it less expensive to produce and may lead to increased performance.

While the 21164 will undoubtedly appear in DEC systems that run Unix and VMS, the company is concentrating its merchant chip efforts on Windows NT. The Alpha architecture leads Mips in the number of supported NT applications, and it enjoys an 18-month to two-year advantage over NT on the PowerPC. If a high-end desktop-and-server market for NT does develop, then DEC's future will be brighter than its immediate past.


Illustration: 21164 Microarchitecture With the Alpha 21164, DEC keeps things clean and simple, relyi ng on a fast clock rather than more complex instructions (and instruction processing) to keep performance high. The most striking aspect of the 21164 is its on-board L2 cache.
Illustration: Alpha Pipelines The 21164's instruction unit enforces all issue and execution rules. Once an instruction moves to an execution unit, it's ready to fly through the pipeline.
Bob Ryan is a BYTE senior technical editor. You can reach him on the Internet or BIX at b.ryan@bix.com .

Up to the Core Technologies section contentsGo to next article: QNX Forges AheadSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network