Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesThe Word on VLIW


Ap ril 1996 / Features / The Word on VLIW

Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC

Dick Pountain

The dust has barely settled in the CISC vs. RISC battle (late score: CISC won by stealing RISC's clothes). The next big one is between very long instruction words (VLIWs) and RISC. While VLIW ideas have been around since the dawn of computing--Turing designed a VLIW computer in 1946--none has been commercially successful. Yet now an Intel/Hewlett-Packard partnership intends to exploit VLIW ideas in next-generation processors.

Can even these industry giants make the concept viable? Maybe not, because VLIW, though promising massive speed gains, involves moving intelligence out of hardware and into the compiler. Success becomes a software problem--and that's a problem.

VLIW: Hardware plus Software

VLIW represents the ultimate of internal parallelism in microprocessor designs. You can do two things to make a microprocessor run faster: Speed up its clock or make it perform more operations during each clock cycle. Speeding up the clock requires inventing ever-faster (read: smaller) fabrication processes and adopting architectural features such as deep pipelines to keep the silicon busy. Performing more operations per cycle means both building multiple function units on the same chip as well as executing enough instructions concurrently--and safely--to keep those units busy.

Safely in this context means producing the correct result. For example, consider two expressions that have a data dependency, such as A:= B + C and B:= D + E. The value of variable A differs depending on which executes first--and only one of these is what the programmer intended. If you execute these expressions in parallel, how do you guarantee the right result?

This scheduling problem is the crux of modern processor design. Superscalar processors such as Intel's Pentium and Pentium Pro (P6) or HP's PA8000 employ special hardware (and lots of it) to uncover instruction dependencies. The Pentium Pro's reorder buffer is one example. However, this approach goes only so far, since the scheduling hardware increases geometrically with the number of function units and eats more chip real estate. Superscalar design already bogs down at around five or six instructions dispatched per cycle.

The alternative is to let software do all the scheduling, and that's precisely what a VLIW design does. A smart compiler can examine a program, find all instructions with no dependencies, string them together in very long batches, and execute them concurrently on an equally big array of function units. Very long instructions are typically between 256 and 1024 bits wide. Such meta-instructions contain many smaller fields, each of which directly encodes an operation for a particular function unit (see the fi gure "Inside a VLIW Processor" ).

In hardware terms a VLIW processor is very simple, consisting of little more than a collection of function units (adders, multipliers, branch units, etc.) connected by a bus, plus some registers and caches. This is good news for semiconductor manufacturers for two reasons. First, more silicon goes to the actual processing (rather than being spent on branch prediction, for example), so you get more bang for the buck. Second, a VLIW processor should run fast, as the only limit is the latency of the function units themselves.

Another attraction to firms like Intel: VLIW may implement old CISC instruction sets more effectively than RISC can. Why? Because programming a VLIW chip is very much like writing microcode. Back when memory was expensive, you could conserve program size by using complex instructions, like the 8086's STOS and LODS (indirect store and load). CISC implements such instructions as microprograms in a microcode ROM on the chip. M icrocode is the ultimate low-level language: synchronizing gates and buses and passing data between function units.

RISC eliminated microcode in favor of hard-wired instructions. VLIW, on the other hand, is like taking that microcode off the chip and putting it into the compiler. As a result, emulating 80x86 instructions like STOS very efficiently as a set of macros should therefore be possible.

The trouble is that writing microcode is unbelievably hard. VLIW becomes viable only if a smart compiler can write it for you. This difficulty has thus far confined VLIW machines to niches such as scientific array processing and signal processing (see the sidebar "Short History of Long Instructions" ).

VLIW Compiler Techniques

Behind the renewed interest in VLIW architectures for general-purpose computing lie significant advances in compiler design over the last decade. A VLIW compiler packs groups of independent operations into very long instruction words in a way that u ses all the function units efficiently during each cycle. The compiler discovers all the data dependencies, then determines how to resolve these dependencies--probably reordering the whole program by moving blocks of code around.

This process differs from a superscalar CPU, which uses special hardware to determine dependencies dynamically at run time. (Optimizing compilers can certainly improve the performance of a superscalar CPU, but the CPU does not depend upon them.) Most superscalar processors will detect dependencies, and schedule parallel execution, only within basic blocks (a group of consecutive statements with no halting or branching except at the end). Some reordering systems, such as those in the Pentium Pro and PA8000, are beginning to reach further afield.

To find more parallelism, a VLIW machine must look for operations from different basic blocks to pack into the same instruction. Trace scheduling is a common technique to do this.

A trace is a possible path throug h a program--the way execution may go for some set of input data. A trace scheduling compiler optimizes at the level of whole traces rather than basic blocks. For VLIW, as for RISC, branching is the enemy of efficient execution: Typical nonscientific code contains a branch about every six instructions.

While RISC predicts branches with hardware, VLIW leaves it up to the compiler. The compiler, in turn, uses information gathered by profiling the program (though future VLIW processors might add a little hardware to collect run-time branch statistics for the compiler). The compiler predicts the most likely trace and schedules it like a big basic block, then repeats this process for all other possible branch outcomes. The compiler may also perform other sophisticated code analyses and tricks, such as loop unrolling and IF-conversion (which temporarily removes all branches from the section being scheduled). Where a RISC might speculatively execute code, a VLIW compiler actually moves that code up before the predicted branch, while preserving enough program state to undo the moved code if necessary.

Proper VLIW hardware design can offer some support to the compiler. For example, a multiway branch operation allows several branches to go into a single wide instruction and perform during the same cycle. Also, conditionally executed operations, whose execution depends on the results of a previous operation, can replace many explicit software branches altogether.

The price to pay for VLIW's increased execution speed is much slower compilation and more expensive compilers. One of the few currently available, Archelon's Rocket C for Sun, costs $10,000.

The Downside of VLIW

VLIW faces other big obstacles. A VLIW compiler must have an intimate knowledge of the hardware details of its processor, down to the number of function units and even their individual latencies. So launching your next-generation CPU with more (or even just faster) units will probably break all the old software, which will require recompiling everything. Had the 486 forced everyone to throw out their 386 software, Intel's balance sheet would undoubtedly have reflected the change.

VLIW advocates suggest a two-stage compilation process. All software would come in a hardware-independent intermediate code that translates into native code only after installation on the user's machine. The OSF's Architecture-Neutral Distribution Format (ANDF) shows that such a system can work. However, while cross-platform software is a desirable goal, PC software developers are often slow to adopt radically new technologies.

Another issue arises over the static nature of VLIW compiler optimizations. How well will such programs perform when faced with dynamic run-time events (such as waiting for I/O) unforeseen at compile time? VLIW arose to meet the needs of scientific number crunching, but it might prove less capable on the sorts of object-oriented and event-driven programs that are more common in the PC community. Not onl y that: How can you verify that a compiler performing such extensive transformations will preserve the correctness of your programs? The truth is, nobody knows. VLIW compilers are still primarily an objet de recherche .

So will the Intel/HP VLIW gamble pay off? They've already started to hedge their bets about moving to a purely VLIW architecture. Intel now intends to produce a version of the P7 that's a straight successor to the Pentium Pro, directly executing x86 instructions. HP will work on a VLIW version of P7 that emulates both x86 and PA-RISC instructions. Target speed: 1 billion instructions per second.

Should Intel/HP's VLIW adventure not pan out, it certainly won't be the first time--nor will it be the last. The intricacies of coordinating VLIW hardware and software offer challenges that have eluded researchers before. It should come as no surprise that the lure of ever-greater speed may sometimes lead down blind alleys.


Inside A VLIW Processor

illustration_link (17 Kbytes)

A VLIW processor like the generic one illustrated above should execute eight operations per cycle on most cycles--with a 200-MHz clock it would be 50 to 100 percent faster than current superscalar chips. Unfortunately, such performance requires the compiler to know intimate hardware details, like the latency of each function unit.

A: Adding extra function units can increase performance (by reducing resource conflicts), with little effect on overall complexity. However, physical limits restrict such expansion: limited read and write ports onto the register file (which requires simultaneous access from all function units), and interconn ections that rise geometrically with the number of function units. Also, the compiler must find enough parallelism in the program to warrant any extra units.

B: This hypothetical 256-bit-wide instruction word has eight operation fields, each one a traditional three-operand RISC-like instruction: . In practice, extra bits may hold immediate values. Each operation field can directly drive a specific function unit with minimal decoding.


Dick Pountain is a BYTE contributing editor based in London. You can reach him at dickp@bix.com .

Up to the Features section contentsGo to previous article: Go to next article: Short History of Long InstructionsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network