Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesVLIW Questions


November 1994 / Core Technologies / VLIW Questions

Will VLIW mean ``very long investment window'' for Hewlett-Packard and Intel?

Peter Wayner

Over the last 10 years, the notion of RISC made its way from the labs of computer architects to the word processors of the marketplace. Along the way, it brought great performance gains to the companies that invested in it (Hewlett-Packard, IBM, Mips, Motorola, Sun, and, to some extent, everyone else) and cemented itself as the guiding philosophy for microprocessor design. And now, just as RISC has won the mind-share war over CISC, along come Intel and HP to roil the waters.

The hot news in Silicon Valley is HP and Intel's announced plan to jointly design a new chip that will run both Intel x86 software and HP Precision Architecture code. Just as important, the companies des cribe the technology they plan to use as ``post-RISC.'' Based on the fact that HP had already announced its interest in VLIW (very large instruction word) and that it has many engineers on-board from VLIW vendor Multiflow, informed opinion is that post-RISC equates to VLIW.

VLIW is a logical extention of RISC. Like a superscalar RISC processor, a VLIW machine executes several simple operations at a time. The difference is where you put the smarts to deal with the dependency issues that arise when you perform several operations in parallel. With VLIW, the smarts come from the compiler, which is responsible for packing many simple instructions into one long instruction word. VLIW compilers are responsible for determining which instructions depend on others. For instance, the compiler can put R11R2->R3 and R41R5->R6 together into the same instruction word, because they do not use the same registers. It cannot bundle R11R2->R3 and R31R4->R5 together because the second instruction needs to wait for the resu lts of the first to be posted to R3.

The Parallel Question

VLIW is a new way to attack an old problem. Scalar RISC and CISC chips that employ pipelining have to deal with many of the same problems of inter-instruction dependency. Like VLIW compilers, compilers for pipelined processors try to rearrange code and spread out interdependent instructions so they do not follow each other down the pipeline. If this isn't done, the CPU must wait until the first instruction is finished before executing the second, and this delay largely destroys the value of the pipeline. The overriding difference between the approaches lies in which piece of the puzzle--the compiler or the chip--takes primary responsibility for instruction scheduling. Conventional technology says the chip does the final, real-time scheduling; VLIW says to leave that job to the compiler.

This debate was common in the mid-1980s when computer architects had to decide the next natural path to take to speed up basic RISC machines. At th at time, heavily pipelined machines that handled dependencies in hardware were easier to build. VLIW machines required constructing multiple logic units to handle the extra instructions packed into a wider word. That meant committing a substantial piece of silicon real estate--especially if a logic unit had to handle something like integer multiplication.

Deep pipelines for RISC machines, on the other hand, can be built by finding a way to split up the stages of the computation into smaller stages. The basic tasks of fetching the information, decoding the instruction, performing the computation, and returning the value are natural choices for pipeline stages. These simple four-stage pipelined machines can, in theory, execute four times as many instructions as a nonpipelined processor can, as long as the interdependence between instructions does not delay the execution. The pipelined approach won out in the end because it was doable in the transistor budgets of the day. As evidence of this success, toda y you find some RISC processors whose pipelines have five or six stages.

As budgets increased, designers started putting multiple execution units on-chip--the superscalar approach--but left the work of handling most dependencies to hardware. They did this because one of the most important advantages to the hardware approach is that any code created for one generation of an architecture can still be used in the next generation, which might have a different, better pipeline or a different number and mix of functional units. Although such code might benefit from recompiling, the precise FIFO (first-in/first-out) ordering enforced a simple discipline that was easy to maintain across generations. This is a major issue in an age when people are still running software on the latest, greatest machines that was written for their original Macintosh or PC.

The Price We Pay

The cost of hardware scheduling and its inherent intergenerational flexibility is complexity. The decode/issue logic must be very intelligent to filter out problems created by running older code on a newer processor or by running scalar code on a superscalar processor. The number of transistors required to implement this level of intelligence is substantial--witness the complex instruction tracking mechanisms used in the PowerPC 620, the AMD K5, and the Mips T5, for example--and the time it takes to execute this work also adds significant overhead to the pipeline. Simpler decode and issue stages would permit clock rates to soar, as these stages normally have the longest latency in current superscalar and superpipelined processors.

This is the promise of VLIW: By removing complexity from the hardware, you create simple processors that let you increase performance far more simply than you can with current processors. On the one hand, simple hardware lets you increase clock speeds more aggressively than is possible with today's complex RISC chips. On the other, you can easily add more functional units to wring out all the paralleli sm that exists in your code.

If VLIW machines are to work well, they require smart compilers that are responsible for identifying which operations can run in parallel. This decision is made at compile time and frozen in place when the operations are packed into instruction words. In essence, the compiler makes many of the interference decisions that are currently made on the run by the decoding stage of a pipelined, superscalar processor.

Compiler Imperatives

Is compiler technology ready for VLIW? There certainly has been no lack of research on the topic. For example, in the mid-1980s, IBM sponsored a research project to develop a test VLIW machine. The research-grade compiler used with it was able to find as many as 10 operations to run concurrently--and this was in nonscientific code. The compiler achieved this level of parallelism by unrolling loops and then percolating the operations up the path of instruction as far as they could go before they encountered interference.

More ve xing are questions of adaptability. Although simplified decoding electronics leads to significant gains in speed, simplified decoders do not have the ability to adapt as well to dynamic run-time situations, such as those you encounter when a branch instruction executes. Even more important, because a VLIW compiler must know the details of the microarchitecture of a target chip, any code that it produces will run well only on the target chip. In a pure VLIW world, moving from one generation of a processor family to another one means that you have to recompile all your code.

It is possible to design an instruction set in which the number of instructions per word varies from chip implementation to chip implementation and that does not require recompilation. What is unknown is how much complexity this introduces in the processor implementations. Will maintaining binary compatibility across VLIW generations mean trading the devil we know--hardware scheduling--for one we don't know?

One thing is certa in: History shows that users place a great deal of emphasis on binary compatibility. The initial success of sales of the Power Macs, Apple's RISC-based Macintosh systems, is in part due to the fact that these computers run existing CISC binaries. In fact, users accepted some loss in performance in exchange for binary compatibility and the promise of faster native applications down the road. Any planned VLIW implementation will have to take binary compatibility into serious consideration, despite the risks.

Why VLIW?

Given the unknowns, there is reason to wonder why HP and Intel chose to stake their CPU futures on VLIW. The key may be that the chips that come out of this agreement must be able to run Intel x86 CISC instructions and run them just as fast as, or even faster than, products from competing x86 vendors, such as AMD and Cyrix. One compelling viewpoint is that CISC instructions are essentially several RISC instructions bundled into one--that is, low-rent VLIW. The PUSH instruction, for in stance, both accesses memory and decrements a pointer. VLIW provides a natural way to split up the CISC instructions into the basic RISC-like operations that would then be executed by the different logical units of the VLIW machine.

How to get there from here is unclear. If the chip devotes substantial resources to breaking up these CISC instructions, then it may be effectively introducing a large decoding operation that would nullify many of the reasons for using VLIW. The HP/Intel alliance might consider doing a one-time cross-compile for the x86 code that would do most of the translation ahead of time, but this would create substantial headaches for the base of installed software and users. Equally important, there is no indication as to how, in the brave new world of VLIW, the companies plan to make one generation of processors binary-compatible with the next. Finally, no one outside of HP and Intel knows how they plan to support three instruction sets (x86, PA-RISC, and native VLIW) on one chip.

The first fruits of the HP/Intel alliance won't be available until 1997 or 1998. Until then, questions will remain concerning the viability of VLIW as a mainstream commercial processor technology. The burden of proof is on HP and Intel. They say it can be done, but don't be surprised if Intel keeps a pure x86 project going on the side--just in case.


VLIW Technology



PRO
-- The compiler handles instruction interdependencies.
-- Faster clock speeds are possible.
-- Added execution units don't increase the complexity of the processor.
-- Similarities to CISC may provide better x86 performance.
CON
-- Very intelligent and complex compilers are required.
-- The compilers work best when they are tuned to a specific
   microarchitecture.
-- There is less flexibility in handling dynamic run-time events; there
   is no native software base.


Peter Wayner is a BYTE consulting editor based in Baltimore, Maryland. In the past, he worked at IBM's T. J. Wat son Research Center on a VLIW compiler. You can reach him on the Internet or BIX at pwayner@bix.com .

Up to the Core Technologies section contentsGo to next article: What is VLIW?SearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network