Will VLIW mean ``very long investment window'' for Hewlett-Packard and Intel?
Peter Wayner
Over the last 10 years, the notion of RISC made its way from the labs of computer architects to the word processors of the marketplace. Along the way, it brought great performance gains to the companies that invested in it (Hewlett-Packard, IBM, Mips, Motorola, Sun, and, to some extent, everyone else) and cemented itself as the guiding philosophy for microprocessor design. And now, just as RISC has won the mind-share war over CISC, along come Intel and HP to roil the waters.
The hot news in Silicon Valley is HP and Intel's announced plan to jointly design a new chip that will run both Intel x86 software and HP Precision Architecture code. Just as important, the companies des
cribe the technology they plan to use as ``post-RISC.'' Based on the fact that HP had already announced its interest in VLIW (very large instruction word) and that it has many engineers on-board from VLIW vendor Multiflow, informed opinion is that post-RISC equates to VLIW.
VLIW is a logical extention of RISC. Like a superscalar RISC processor, a VLIW machine executes several simple operations at a time. The difference is where you put the smarts to deal with the dependency issues that arise when you perform several operations in parallel. With VLIW, the smarts come from the compiler, which is responsible for packing many simple instructions into one long instruction word. VLIW compilers are responsible for determining which instructions depend on others. For instance, the compiler can put R11R2->R3 and R41R5->R6 together into the same instruction word, because they do not use the same registers. It cannot bundle R11R2->R3 and R31R4->R5 together because the second instruction needs to wait for the resu
lts of the first to be posted to R3.
The Parallel Question
VLIW is a new way to attack an old problem. Scalar RISC and CISC chips that employ pipelining have to deal with many of the same problems of inter-instruction dependency. Like VLIW compilers, compilers for pipelined processors try to rearrange code and spread out interdependent instructions so they do not follow each other down the pipeline. If this isn't done, the CPU must wait until the first instruction is finished before executing the second, and this delay largely destroys the value of the pipeline. The overriding difference between the approaches lies in which piece of the puzzle--the compiler or the chip--takes primary responsibility for instruction scheduling. Conventional technology says the chip does the final, real-time scheduling; VLIW says to leave that job to the compiler.
This debate was common in the mid-1980s when computer architects had to decide the next natural path to take to speed up basic RISC machines. At th
at time, heavily pipelined machines that handled dependencies in hardware were easier to build. VLIW machines required constructing multiple logic units to handle the extra instructions packed into a wider word. That meant committing a substantial piece of silicon real estate--especially if a logic unit had to handle something like integer multiplication.
Deep pipelines for RISC machines, on the other hand, can be built by finding a way to split up the stages of the computation into smaller stages. The basic tasks of fetching the information, decoding the instruction, performing the computation, and returning the value are natural choices for pipeline stages. These simple four-stage pipelined machines can, in theory, execute four times as many instructions as a nonpipelined processor can, as long as the interdependence between instructions does not delay the execution. The pipelined approach won out in the end because it was doable in the transistor budgets of the day. As evidence of this success, toda
y you find some RISC processors whose pipelines have five or six stages.
As budgets increased, designers started putting multiple execution units on-chip--the superscalar approach--but left the work of handling most dependencies to hardware. They did this because one of the most important advantages to the hardware approach is that any code created for one generation of an architecture can still be used in the next generation, which might have a different, better pipeline or a different number and mix of functional units. Although such code might benefit from recompiling, the precise FIFO (first-in/first-out) ordering enforced a simple discipline that was easy to maintain across generations. This is a major issue in an age when people are still running software on the latest, greatest machines that was written for their original Macintosh or PC.
The Price We Pay
The cost of hardware scheduling and its inherent intergenerational flexibility is complexity. The decode/issue logic must be very
intelligent to filter out problems created by running older code on a newer processor or by running scalar code on a superscalar processor. The number of transistors required to implement this level of intelligence is substantial--witness the complex instruction tracking mechanisms used in the PowerPC 620, the AMD K5, and the Mips T5, for example--and the time it takes to execute this work also adds significant overhead to the pipeline. Simpler decode and issue stages would permit clock rates to soar, as these stages normally have the longest latency in current superscalar and superpipelined processors.
This is the promise of VLIW: By removing complexity from the hardware, you create simple processors that let you increase performance far more simply than you can with current processors. On the one hand, simple hardware lets you increase clock speeds more aggressively than is possible with today's complex RISC chips. On the other, you can easily add more functional units to wring out all the paralleli
sm that exists in your code.
If VLIW machines are to work well, they require smart compilers that are responsible for identifying which operations can run in parallel. This decision is made at compile time and frozen in place when the operations are packed into instruction words. In essence, the compiler makes many of the interference decisions that are currently made on the run by the decoding stage of a pipelined, superscalar processor.
Compiler Imperatives
Is compiler technology ready for VLIW? There certainly has been no lack of research on the topic. For example, in the mid-1980s, IBM sponsored a research project to develop a test VLIW machine. The research-grade compiler used with it was able to find as many as 10 operations to run concurrently--and this was in nonscientific code. The compiler achieved this level of parallelism by unrolling loops and then percolating the operations up the path of instruction as far as they could go before they encountered interference.
More ve
xing are questions of adaptability. Although simplified decoding electronics leads to significant gains in speed, simplified decoders do not have the ability to adapt as well to dynamic run-time situations, such as those you encounter when a branch instruction executes. Even more important, because a VLIW compiler must know the details of the microarchitecture of a target chip, any code that it produces will run well only on the target chip. In a pure VLIW world, moving from one generation of a processor family to another one means that you have to recompile all your code.
It is possible to design an instruction set in which the number of instructions per word varies from chip implementation to chip implementation and that does not require recompilation. What is unknown is how much complexity this introduces in the processor implementations. Will maintaining binary compatibility across VLIW generations mean trading the devil we know--hardware scheduling--for one we don't know?
One thing is certa
in: History shows that users place a great deal of emphasis on binary compatibility. The initial success of sales of the Power Macs, Apple's RISC-based Macintosh systems, is in part due to the fact that these computers run existing CISC binaries. In fact, users accepted some loss in performance in exchange for binary compatibility and the promise of faster native applications down the road. Any planned VLIW implementation will have to take binary compatibility into serious consideration, despite the risks.
Why VLIW?
Given the unknowns, there is reason to wonder why HP and Intel chose to stake their CPU futures on VLIW. The key may be that the chips that come out of this agreement must be able to run Intel x86 CISC instructions and run them just as fast as, or even faster than, products from competing x86 vendors, such as AMD and Cyrix. One compelling viewpoint is that CISC instructions are essentially several RISC instructions bundled into one--that is, low-rent VLIW. The PUSH instruction, for in
stance, both accesses memory and decrements a pointer. VLIW provides a natural way to split up the CISC instructions into the basic RISC-like operations that would then be executed by the different logical units of the VLIW machine.
How to get there from here is unclear. If the chip devotes substantial resources to breaking up these CISC instructions, then it may be effectively introducing a large decoding operation that would nullify many of the reasons for using VLIW. The HP/Intel alliance might consider doing a one-time cross-compile for the x86 code that would do most of the translation ahead of time, but this would create substantial headaches for the base of installed software and users. Equally important, there is no indication as to how, in the brave new world of VLIW, the companies plan to make one generation of processors binary-compatible with the next. Finally, no one outside of HP and Intel knows how they plan to support three instruction sets (x86, PA-RISC, and native VLIW) on one chip.
The first fruits of the HP/Intel alliance won't be available until 1997 or 1998. Until then, questions will remain concerning the viability of VLIW as a mainstream commercial processor technology. The burden of proof is on HP and Intel. They say it can be done, but don't be surprised if Intel keeps a pure x86 project going on the side--just in case.
VLIW Technology
PRO
-- The compiler handles instruction interdependencies.
-- Faster clock speeds are possible.
-- Added execution units don't increase the complexity of the processor.
-- Similarities to CISC may provide better x86 performance.
CON
-- Very intelligent and complex compilers are required.
-- The compilers work best when they are tuned to a specific
microarchitecture.
-- There is less flexibility in handling dynamic run-time events; there
is no native software base.
Peter Wayner is a BYTE consulting editor based in Baltimore, Maryland. In the past, he worked at IBM's T. J. Wat
son Research Center on a VLIW compiler. You can reach him on the Internet or BIX at
pwayner@bix.com
.