Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesSPARC Strikes Back


Nov ember 1994 / State Of The Art / SPARC Strikes Back

UltraSparc moves SPARC to 64 bits while providing a host of video and graphics capabilities

Peter Wayner

Several years ago, the world order was very regimented: computers were computers, phones were phones, and TVs were TVs. Today, the coherency of that structure is rapidly blurring as microprocessors wiggle their way into more and more places. The next version of Sun Microsystems' (Mountain View, CA) venerable SPARC line is aimed at being both faster than ever and more suited for the diverse rolls that a microprocessor will play in the video world.

Sun made substantial and incredibly varied changes to SPARC. The most exciting one is including on-chip a collection of image-processing functions that can operate on up to 8 pixels at once. Other changes are where the designers have tweaked the struc ture of context switching to allow SPARC to better run multithreaded operating systems and have improved the instruction set to allow better optimization by compilers. Some changes are necessary to bring SPARC into the 64-bit world that Alpha and Mips, at least, have inhabited for years. Other changes are more specialized and detailed.

The changes to SPARC come in two forms: At the abstract level, Sun has issued the SPARC-V9 revisions to the SPARC architecture that spell out in detail what constitutes a SPARC-compatible chip. This permits companies such as Fujitsu to produce their own versions of the latest definition of SPARC. SPARC-V9 is the first major set of revisions to emerge since the commercial SPARC architecture was announced in the 1980s as the SPARC-V7. SPARC-V8 contained relatively minor revisions to the original architecture.

On the more concrete level, Sun has announced UltraSparc, the first implementation of the SPARC-V9 architecture. This chip will be produced in a partnership wi th Texas Instruments. Although the final numbers are not available at this writing, Sun hopes to produce a product that runs between 250 to 300 SPECint92 using a chip that runs about 200 MHz. Sun expects the processing ability to scale linearly with the clock speed. This chip should be available in samples by early 1995 and in quantity by the end of 1995.

Pieces of Eight

The built-in instructions for handling video tasks are perhaps the most novel addition to the architecture. Digital manipulation of graphics and video data in computing environments is growing exponentially, and many companies are developing specialized chips to handle these tasks. UltraSparc represents one of the first general-purpose processors to integrate these functions. It has specialized hardware to process image data packed in the typical RGB and alpha format. Each of these components can be represented with either 8, 16, or 32 bits. The instruction set includes new instructions that will load and manipulate data in 64-bi t blocks. The pixels in one 64-bit block can be either added to or multiplied by the pixels in another 64-bit block in a single operation.

The image manipulation is performed in the FPU, which normally operates on 64-bit floating-point quantities. In the case of pixel addition, the chip simply forgets to carry a bit every 8, 16, or 32 bits. The process is more complicated for pixel multiplication and relies heavily on shifters that are used extensively in floating-point arithmetic.

The pixel operations can also be split to increase processing parallelism. For example, one of the new instructions can multiply four 16-bit numbers by four 8-bit numbers without choking the pipeline. If you want to do 16- by 16-bit multiplies, then you use two of these instructions and combine the results with an addition.

The ability to perform up to eight operations at once in parallel is useful when compressing and decompressing video images. The most time-consuming part of the MPEG algorithm is trying to a nalyze the motion of the image by comparing each part of the current frame against the previous frame. The UltraSparc comes with a special instruction that will do the eight subtractions, eight absolute values, and eight additions this comparison takes--as well as the final work of aligning the information--all in a single graphics-unit operation. A special memory system automatically loads pixels in 8-byte blocks without a separate instruction. When these specialized instructions are pipelined, the chip will sustain one operation per cycle.

The benefit of these special instructions is enormous. Sun hopes that the UltraSparc will be able to deliver performance that is up to 80 times faster than other RISC processors on pixel manipulation operations. Sun estimates that the chip will be able to decompress two MPEG-2 video streams and perform video processing in real time. On the other end, the chip should be able to provide real-time MPEG-2 compression.

It is not clear whether these instructions w ill provide similar gains to other applications. Several more specialized scientific applications may be able to use the parallelism to speed up the work substantially. The ones that will benefit will be those that can operate with low-precision, fixed-point values.

Memory Improvements

The greatest headache for any modern processor designer is getting data on and off the chip. The UltraSparc has several features that should significantly improve the memory performance of the chip. Some of these changes will boost multimedia performance, and others are aimed at helping average system tasks.

The biggest change, at least in the volume of bits moved, is a new block move instruction that circumvents the normal cache structure. Using this instruction, you can move up to 600 MBps across the processor/memory bus. This lets the main processor act as the video processor by blitting data on and off the screen. This block move also comes in handy in other applications that shuffle memory. Sun's system architects say that they've watched the TCP/IP networking software move packets of data up to eight times before it reaches its final destination. Given that most UltraSparc machines will be networked, the block move instruction can help hold down networking overhead.

The other parts of the memory interface are fairly standard. The UltraSparc has split primary caches. The data cache is 16 KB and direct-mapped, while the instruction cache also holds 16 KB, though it is two-way set-associative. Both caches have their own TLB (translation look-aside buffer). The UltraSparc also comes with an on-chip cache manager for an off-chip second-level cache. You only need to add SRAM (static RAM) to have a fully functioning second-level cache.

Instruction fetching is tightly integrated with the first-level instruction cache. The instructions stored in the cache are predecoded to speed their processing when they enter the execution pipeline. Every two instructions in the cache are associated with 2 bits that are used to predict branches taken by the instructions. The 2 bits keep track of four different states that encode the last two paths taken by these instructions.

The prefetching mechanism uses the bits to dynamically predict branches. Sun's preliminary studies show that the UltraSparc is able to predict the right path in 88 percent of the branches taken in the SPECint92 test suite and 94 percent of the time in the SPECfp92 set.

Into the Pipeline

The execution pipeline is the backbone of a modern chip, and its structure defines the performance limits. The UltraSparc comes with a nine-stage pipeline that can issue up to four instructions per cycle. The first two stages are standard: The instructions are fetched and then decoded.

The third stage groups together any possible instructions that are available for issue to the execution units. The chip will not issue the instructions out of order, and Sun is confident that its compilers will be able to do a good job scheduling the instruc tions to maximize throughput. There are particular rules about which instructions may be bundled together. There is a limit of two integer operations, two floating-point or graphics operations, one load/store memory access, and one branch that can be issued each tick of the clock. Even though this adds up to six possible instructions, only four can go at once.

In addition, this stage is responsible for getting the information from the registers. If the information is not ready, it will stall any instruction that depends on it until it is ready. Sun says that it is closing in on the magical one-instruction-per-cycle average.

After issue, the pipeline splits into two parts. One fork handles the integer and memory instructions and the other handles the floating-point and graphics instructions. The floating-point instructions travel down a three-stage pipeline that is tuned to handle everything except floating-point division and square roots. A separate functional unit attacks these without stalling the pipeline. The chip issues instructions in order, but they don't need to finish in the same order.

Basic integer instructions execute in one cycle. Others such as integer multiplication and division have variable latencies. For example, the UltraSparc executes 2 bits of the multiplicand or 1 bit of the divisor per cycle (the chip is thus very human in its performance: Bigger numbers take longer). Once an integer instruction executes, a bypass mechanism makes its results available immediately, rather than after the writeback stage.

The rest of the integer/memory pipeline is devoted to handling the loads and stores. These can occasionally take a long time if the data is not available in the on-chip cache. Sun worked to keep these stages in the integer pipeline the same size as the floating-point pipeline so that the results from the two can be rejoined in the final stage when the information is written back to memory or registers. These loads and stores do not have to finish in their programme d order, which significantly adds to pipeline throughput.

Context Switching

The pipeline structure governs how well a chip will do on a straight-line segment of code, but it says little about how a chip will perform on a desktop when it is often forced to handle a number of different programs. The ability to switch quickly between different blocks of code (i.e., context-switching) is becoming more important than ever because both modern multithreaded operating systems and OOP (object-oriented programming) are slicing the programs into smaller bits or contexts.

The SPARC architecture is the only RISC processor on the market that uses register windows. Instead of 32 basic registers, the chip offers eight overlapping windows of 24 registers each. The theory behind these windows is that when a new procedure or thread begins, the windows would obviate the need to write the old information out to memory; the new context would simply be a new ``window'' of registers. In practice, many compiler wr iters found that they would quickly exhaust the supply of windows, so they needed to pause and write the information to memory anyway.

Register windows have caused Sun some grief. Other RISC designers were able to produce nice small sets of 32 registers with a much simpler design that would run faster. Also, other compiler designers found they didn't need many of the advantages of the overlapping nature of the registers because they could simply compile short procedures in-line. Sun couldn't abandon windows without losing backward-compatibility with the old SPARC software. For this reason, with UltraSparc, it concentrated on adding several different improvements for handling context switches.

One of the neater solutions is providing another fresh window of registers everytime a trap, an interrupt, or an MMU (memory management unit) trap is sprung. Anyone writing a software routine implemented as a trap must ordinarily save all the information in the registers so that the routine does not destroy the results of the process that was interrupted. The UltraSparc provides eight fresh registers that can be used without worry in these cases. This should significantly improve the speed at which the UltraSparc handles code of multithreaded operating systems that use many traps and interrupts.

Onward, Upward

In its early years, Sun recognized that the main demands on a desktop Unix box were to do simple integer pointer arithmetic and move the data around. So it produced a RISC instruction set that did just this and nothing more. Now that the system demands on a desktop machine are no longer as significant, Sun is changing the instruction mix to supply what they hope desktop users will want: hot graphics and video processing.

The graphics instructions and the fast block data transfers should let Sun build low-cost desktop systems that offer stunning video processing. The graphics-processing instructions will be able to speed up video processing and graphics generation. This should become a tantalizing addition to the desktop and may even let Sun make substantial inroads in the potential market for set-top computers.


UltraSPARC Highlights

-- Multimedia instructions can process up to 8 pixels at once for
   MPEG decompression.
-- Upgrades SPARC to 64-bit architecture.
-- Single-cycle branch prediction integrated into first-level cache.
-- Fabricated on Texas Instruments' 0.5-micron CMOS, 3.3-V
   four-layer metalization process.
-- Sun claims 250 to 300 SPECint92 at 200 MHz.
-- Samples in the first quarter of 1995; ships in 1995.
-- Integrated second-level cache support.
-- Speculative loads allow a load that might fail because of a nil
   pointer to be done without testing.
-- Up to 600-MBps block data transfer without affecting caches.
-- Nested traps.
-- Logic units: two integer, two floating-point/graphics addition,
   two floating-point/graphics multiplication, one floating-point
   division/square root, one branch, and one load/store.
-- 16-KB instructi
on cache; 16-KB data cache.




Illustration: Inside UltraSparc The UltraSparc is unusual in that it devotes so many internal resources to graphics and video processing. In addition to the two pixel processing engines in the FPU, the UltraSparc also can handle block moves of up to 600 MBps that bypass the caches. Sun is pointing squarely at a multimedia future with this chip.
Illustration: UltraSparc Pipelines The SPARC pipeline breaks into two paths for integer and floating-point instructions. The key work is done in the third stage that groups together up to four instructions that can be executed simultaneously. Although the instructions must start in order, they can finish out of order without stalling the pipeline. The long tail of the integer pipeline is needed to handle memory access to data not found in the cache. Integer results don't have to wait until writeback, however; a bypass mechanism makes them available immediately after their execution stage.
Peter Wayner is a BYTE consulting editor living in Baltimore, Maryland. You can contact him on the Internet or BIX at pwayner@bix.com .

Up to the State Of The Art section contentsGo to previous article: x86 Wars UpdateGo to next article: PowerPC 620 SoarsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network