Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesPowerPC 620 Soars


Novem ber 1994 / State Of The Art / PowerPC 620 Soars

Its faster logic, shorter pipelines, and high-speed interface endow it with processing power that raises it to workstation and server caliber

Tom Thompson and Bob Ryan

In 1991, Apple, IBM, and Motorola formed an alliance whose goal was to create a new hardware and software standard for personal computing. The hardware portion of the standard is centered around the PowerPC architecture, a 64-bit machine environment that uses a single-chip RISC processor. This architecture serves as a template for a growing family of PowerPC processors, each with a design carefully tailored to address the computing needs of a different market.

The IBM/Motorola Somerset facility, located in Austin, Texas, is the work site of the engineers responsible for creating the various processor designs. Since 1991, Somerset desi gn teams working in parallel have introduced new versions of the PowerPC processor with relentless regularity. The PowerPC 601, a low-cost 32-bit implementation of the PowerPC architecture, was introduced in early 1993. Fall of that year saw first silicon on the PowerPC 603, a low-power 32-bit processor suitable for notebook computers. Early this year, the PowerPC 604, a high-performance 32-bit processor designed for high-end desktop systems, made its debut.

At the Microprocessor Forum in October, IBM and Motorola jointly announced first silicon on the PowerPC 620, the first 64-bit implementation of the PowerPC architecture in a processor. While the previous members of the PowerPC family were targeted for desktop PCs, the 620 is instead crafted for workstations and high-speed servers.

Based on simulations at 133 MHz with 4 MB of secondary level-2 cache clocked at 66.5 MHz, the PowerPC 620 posts performance marks of 225 SPECint92 and 300 SPECfp92. Key design features, such as 64-bit internal data paths, 64 KB of on-chip cache, six independent execution units, and a high-speed bus interface, provide the high level of performance required by simulations and transaction processing. The 620 is code-compatible with earlier PowerPC processors and can execute existing 32-bit PowerPC programs, as well as new 64-bit programs written specifically to exploit features on the 620.

Sampling of the PowerPC 620 begins in the second quarter of 1995, and the chip should be available in limited quantities by the second half of 1995. Pricing was not set at press time, but it was expected to follow the competitive pricing set by other members of the PowerPC family.

Processor Basics

The PowerPC 620 uses the 0.5-micron CMOS four-metal layer-fabrication technology, which is similar to that used in the PowerPC 604. However, the 620 fabrication process also uses an improved transistor design that switches faster, thereby improving overall performance. The PowerPC 620 operates at 3.3 V, the same as the Powe rPC 603 and 604.

As with these other two processors, an on-chip PLL (phase-locked loop) on the PowerPC 620 acquires the processor clock, and the processor's bus interface can operate at one half, one third, or one fourth the speed of the processor clock to support slower memory or devices. At 133 MHz, the PowerPC 620 dissipates 30 W in a worst-case scenario. The PowerPC 620 also sports the same power management features as the 603 and 604, which can be used to reduce power consumption and build an energy-efficient computer.

However, the PowerPC 620's resemblance to other members of the PowerPC family ends here. The chip's design uses 7 million transistors--nearly double the number in the 604 design. To house that many circuits requires a large (for a PowerPC) 311-mm2 die.

These extra transistors implement several key features. First, because the 620 is a 64-bit processor, additional hardware is required to support 64-bit data types and 64-bit addressing. This means that many of the proces sor's internal data buses and buffers, as well as the GPRs (general-purpose registers) and FPRs (floating-point registers), must be 64 bits wide.

The second original feature of the 620 is the presence of two massive, 32-KB on-chip caches. The 620, like the 603 and 604, implements a Harvard architecture with separate code and data paths. One cache handles the code path, and the other handles the data. Each cache has its own MMU (memory management unit) and functions independently of the other.

Third, the 620 employs an aggressive branch-prediction mechanism that requires prediction logic plus 64-bit rename buffers and reservation stations to store speculated branch results. This, in turn, uses more transistors.

Finally, the processor's bus interface has been beefed up: The data bus is 128 bits wide, and direct support for a level-2 cache is built in. All these new features work in concert to boost the 620's performance.

620 Interiors

At first glance, the heart of the PowerPC 6 20 looks identical to that of the 604. Both have the same six independent execution units: a load/store unit, a branch unit, an FPU, and three integer units. This enables up to four instructions to be fetched and dispatched at each tick of the processor clock. Because this and other 620 features resemble those of the 604, some comparisons to the 604 are necessary.

While the 620 uses a superscalar RISC core similar to that of the 604, specific design enhancements endow the 620 with its workstation-caliber performance. The major difference between the 620 and the 604 is that the 620 uses an improved bus-interface unit and memory subsystem to pump data into and out of the processor faster. The 620 also has a 128-bit data interface, so it fetches two longwords (64 bits each) of data during every bus access. The bus interface has 40 address lines, which enables the processor to access 1 TB of physical memory.

Note that although the 620 uses only 40 bits of address, internally it uses 64-bit effective addressing and thus supports 80-bit virtual addresses. Needless to say, the wide data path and additional address lines mean that the 620 is decidedly not pin-compatible with the 604: It has 482 pins, versus the 604's 304 pins. The 620 uses a BGA (ball grid array) package.

The PowerPC 620's bus interface includes integral support for a unified (i.e., both code and data) level-2 cache, whose size is configurable from 1 to 128 MB. The cache-interface signals can run at the same speed as the processor clock rates or at one half or one quarter their speed, which allows the construction of a level-2 cache from slower RAM. This on-chip cache interface eliminates the extra clock cycles normally required to drive any external level-2 cache logic.

For a single-processor system, the level-2 cache interface improves performance by moving the data through the processor faster. In a multiprocessor system, the level-2 cache interface minimizes bus traffic. It does so by using a bus protocol that's designed t o be tightly coupled with snoop-response pipelining. This improves the rate at which addresses issue onto the bus, without the latency of bus-snooping activity. The result is faster shared-memory access, which is vital in an environment where two or more processors exchange data or access shared semaphores and flags.

Inside the PowerPC 620, fetched code and data land in the internal 32-KB caches. The data cache supports both write-through and write-back modes and uses the MESI (Modified, Exclusive, Shared, Invalid) protocol to maintain cache coherency. On the code side of the street, instructions pass through a predecoder on their way to the internal code cache (see the figure ``The 620 Microarchitecture''). The predecoded instructions reside in the code cache until the dispatch/completion unit fetches them.

Because of this up-front predecoding, the remaining decoder logic is merged into the dispatch stage of the processor pipelines. This effectively shortens the 620's pipelines from six stages to five (fetch, decode/dispatch, execute, complete, and writeback). The shorter pipelines mean that each instruction completes in fewer clock cycles, resulting in faster overall code execution.

Once fetched by the decode/dispatch unit, instructions are assigned a rename buffer that temporarily holds any instruction results, such as write to another register. The rename buffers make possible the speculative execution of instructions based on branch prediction, since an operation's results remain in this buffer until the outcome of a branch instruction is resolved. If the branch prediction is correct, the rename-buffer contents are written to the architectural registers. If not, the rename-buffer contents are discarded.

As with the 604, a 16-entry reorder buffer in the 620 tracks the status of an instruction from dispatch to completion. Significantly, the 620 can release up to four rename buffers per cycle, versus just two for the 604. This makes the existing rename buffers more readily available to other instructions in the pipeline. Furthermore, the shorter pipelines process instructions faster. The combination of these two features means fewer rename buffers are needed to store the intermediate results of speculative executions. Therefore, the 620 has only 16 rename buffers (eight GPRs and eight FPRs) total, versus a total of 20 (12 GPRs and eight FPRs) for the 604.

Next, the decode/dispatch unit issues instructions to the six execution units. Up to four instructions are dispatched per cycle to the execution units. Each unit has two or more reservation stations, which act as temporary storage for those dispatched instructions that depend on the results of other instructions. The reservation stations thus keep the instruction-dispatch bus clear so that the dispatch unit can continue to issue instructions to other execution units. If there are sufficient reservation stations available, an execution unit that stalls because of code dependencies won't impede the instruction dispatch or the opera tion of those execution units (e.g., the integer units) that can execute instructions out of order.

To this end, the PowerPC 620 has several more reservation stations than the 604: The 620's branch unit has four (versus two on the 604), and the 620's load/store unit has three, as opposed to the 604's two. The 620 provides in-order instruction dispatch and out-of-order execution. The reorder buffer weaves the instruction results together so that instructions ultimately complete in program order.

Like the PowerPC 604, the 620 implements dynamic branch prediction. But the 620 has a more aggressive branch-prediction logic that can speculatively execute up to four unresolved branch instructions, versus only two on the 604. To accomplish this, the 620 uses a larger 2048-entry BHT (branch history table) that records and tracks the usage history of each branch instruction encountered. Also, the 620 has a larger, 256-entry BTAC (branch-target address cache) in which it caches the instruction and target a ddresses. By contrast, the 604's BHT holds 512 entries, and its BTAC has only 64 entries.

Simulations run by the PowerPC's designers show that the branch-prediction logic is 90 percent accurate, which translates to little or no execution delays on program branches most of the time. In those cases where a bad branch prediction occurs, the penalty for recovering the thread of execution is typically four clock cycles. On the first cycle, the PowerPC 620 completes all instructions up to and including the branch and calculates the address of the correct branch path. (This operation sometimes takes more than one cycle.) The second cycle flushes the pipelines and fetches the correct instructions. The third cycle dispatches the instructions; the fourth cycle executes them.

Fast Floating-Point Performance and Modes

The 604's PC-based design emphasizes integer performance, as dictated by the needs of its applications. But workstation applications anticipated to run on the 620--such as data capture a nd visualization, scientific simulations, and real-time analysis of market trends--make heavy use of exotic equations to compute thousands of results per second. Thus, they require rip-roaring floating-point performance.

As the SPECmarks mentioned at the beginning of this article indicate, the 620 easily serves up floating-point performance that's much better than its integer performance. The PowerPC's designers achieved this by implementing key improvements in certain execution units and by fine-tuning the RISC core's throughput.

In the 620's FPU, the engineers worked to improve the speed of the divide (fdiv) and square-root instructions (fsqrt). The divide instruction is a computationally expensive instruction and is used frequently, so any enhancement in its execution speed has an impact on all floating-point computations. The engineers decided to also speed up the square-root instruction because of its high frequency of use. The fdiv instruction, which takes 32 clock cycles on the 604, takes just 18 on the 620. The fsqrt instruction, which was emulated in software on the 604, now executes in 22 clock cycles. For the load/store unit, one clock cycle was shaved off floating-point data accesses.

All these improvements add up to better floating-point throughput. However, the designers obtained the most significant performance gains by engineering the processor to get data through the RISC core faster. The PowerPC 620's wider data paths, shorter pipelines, on-chip caches, and level-2 cache support all contribute to shipping large amounts of floating-point data in, through, and out of the FPU.

Over time, 64-bit applications will be written to take advantage of the huge address space the 620 offers. To this end, the 620 implements 38 new instructions as part of the 64-bit PowerPC architecture. Specific 64-bit instructions that such applications might use are load/store instructions that access longwords of data, such as load doubleword (ld) and store doubleword (std). For compatibility wi th the existing base of 32-bit PowerPC applications (such as it is now), the 620 can execute them without modification.

A mode bit in the processor's MSR (machine state register) indicates which mode the PowerPC 620 operates in (32- or 64-bit). There's no penalty for running the processor in either mode; in 32-bit mode, the bits in the lower half of the 620's 64-bit registers are guaranteed to correspond to those in a 32-bit PowerPC processor. Furthermore, the mode bit in the MSR is under software control, so it's possible for a 64-bit operating system to change the processor environment on-the-fly to execute 32-bit applications. There would be some overhead on the part of the operating system to manage the mode switch.

One unique feature found in all PowerPC processors--but which the 620 should be able to put to good use--is the ability to assume either big-endian (Motorola) or litttle-endian (Intel) address modes under software control. One bit in the MSR determines the addressing mode; anothe r bit indicates the addressing mode of an interrupt handler. This lets a big-endian operating system run little-endian applications. When a hardware interrupt occurs for an operating-system service, the addressing mode can be switched to big-endian for the duration of the interrupt handler's execution. A 620-based workstation could thus host application code from different operating systems (say, a Unix operating system running Windows applications) with respectable performance.

Future Directions

The PowerPC 620 is a promising addition to the PowerPC family of processors, offering workstation-class throughput and paying special attention to floating-point performance. Its speed and power consumption are comparable to those of other RISC processors. However, processing speed is always a moving target in this fast-paced business: By the time the first 620-based system appears in the latter half of 1995, we'll be witnessing the first silicon on a new generation of faster chips from other RISC vendor s.

While the 620 is the last of the publicly announced processors, the Somerset engineers are busily working on next-generation processors and enhancements to existing designs. The PowerPC alliance is understandably quiet about information on future processors, but its efforts to enhance existing designs are already well known, as is evidenced by the 601+.

In the 601+ processor, a 0.5-micron, five-metal-layer, local-interconnect process shrinks the die size from the original 121 mm2 to 74 mm2. It also reduces the operating voltage from 3.6 V to 2.5 V, so the 601+ dissipates 4 W at 100 MHz. This is half the power consumption of the original PowerPC 601 operating at 66 MHz, and close to the maximum output of the PowerPC 603 (3 W at 80 MHz). Expect similar improvements to appear in the PowerPC 603, 604, and 620 designs.

Key to the PowerPC's future survival is its acceptance by users. Initial activity in this area, though limited, is promising. The only PowerPC-based systems on the market at this writing are Apple's line of Power Macs, which use the PowerPC 601. Apple sold over 345,000 of these systems in just four months, becoming the largest RISC-computer vendor on the planet. If this trend continues, especially when high-speed PowerPC 603-, 604-, and 620-based systems from Apple, IBM, and other vendors appear, then the alliance's hopes of creating a new standard for desktop computers might succeed after all.


Features of the PowerPC 620

-- 133-MHz processor clock
-- Half-, third-, or quarter-speed bus-interface clock
-- 128-bit data bus
-- 40-bit address bus
-- 64-bit GPRs and FPRs
-- 3.3-V operation
-- 7 million transistors
-- 331-mm2 die
-- Six execution units
-- Split 32-KB caches
-- Built-in level-2 cache interface




Figure: The 620 Microarchitecture A block diagram of the PowerPC 620. Although it closely resembles the PowerPC 604 in structure, the GPR, FPR, and internal buses are 64 bits wide. The predecoder unit sits in front of the code cache and helps shorten the processor pipeline by one stage.
Illustration: The PowerPC 620 uses an improved, faster-switching transistor design.
Tom Thompson is a BYTE senior technical editor at large with a B.S.E.E. from Memphis State University. You can contact him on the Internet or BIX at tom_thompson@bix.com . Bob Ryan is a BYTE senior editor. You can contact him on the Internet or BIX at b.ryan@bix.com .

Up to the State Of The Art section contentsGo to previous article: SPARC Strikes BackGo to next article: T5: Brute ForceSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network