Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC
Dick Pountain
The dust has barely settled in the CISC vs. RISC battle (late score: CISC won by stealing RISC's clothes). The next big one is between very long instruction words (VLIWs) and RISC. While VLIW ideas have been around since the dawn of computing--Turing designed a VLIW computer in 1946--none has been commercially successful. Yet now an Intel/Hewlett-Packard partnership intends to exploit VLIW ideas in next-generation processors.
In hardware terms a VLIW processor is very simple, consisting of little more than a collection of function units (adders, multipliers, branch units, etc.) connected by a bus, plus some registers and caches. This is good news for semiconductor manufacturers for two reasons. First, more silicon goes to the actual processing (rather than being spent on branch prediction, for example), so you get more bang for the buck. Second, a VLIW processor should run fast, as the only limit is the latency of the function units themselves.
Another attraction to firms like Intel: VLIW may implement old CISC instruction sets more effectively than RISC can. Why? Because programming a VLIW chip is very much like writing microcode. Back when memory was expensive, you could conserve program size by using complex instructions, like the 8086's STOS and LODS (indirect store and load). CISC implements such instructions as microprograms in a microcode ROM on the chip. M
icrocode is the ultimate low-level language: synchronizing gates and buses and passing data between function units.
RISC eliminated microcode in favor of hard-wired instructions. VLIW, on the other hand, is like taking that microcode off the chip and putting it into the compiler. As a result, emulating 80x86 instructions like STOS very efficiently as a set of macros should therefore be possible.
The trouble is that writing microcode is unbelievably hard. VLIW becomes viable only if a smart compiler can write it for you. This difficulty has thus far confined VLIW machines to niches such as scientific array processing and signal processing (see the sidebar
"Short History of Long Instructions"
).
VLIW Compiler Techniques
Behind the renewed interest in VLIW architectures for general-purpose computing lie significant advances in compiler design over the last decade. A VLIW compiler packs groups of independent operations into very long instruction words in a way that u
ses all the function units efficiently during each cycle. The compiler discovers all the data dependencies, then determines how to resolve these dependencies--probably reordering the whole program by moving blocks of code around.
This process differs from a superscalar CPU, which uses special hardware to determine dependencies dynamically at run time. (Optimizing compilers can certainly improve the performance of a superscalar CPU, but the CPU does not depend upon them.) Most superscalar processors will detect dependencies, and schedule parallel execution, only within basic blocks (a group of consecutive statements with no halting or branching except at the end). Some reordering systems, such as those in the Pentium Pro and PA8000, are beginning to reach further afield.
To find more parallelism, a VLIW machine must look for operations from
different
basic blocks to pack into the same instruction. Trace scheduling is a common technique to do this.
A trace is a possible path throug
h a program--the way execution may go for some set of input data. A trace scheduling compiler optimizes at the level of whole traces rather than basic blocks. For VLIW, as for RISC, branching is the enemy of efficient execution: Typical nonscientific code contains a branch about every six instructions.
While RISC predicts branches with hardware, VLIW leaves it up to the compiler. The compiler, in turn, uses information gathered by profiling the program (though future VLIW processors might add a little hardware to collect run-time branch statistics for the compiler). The compiler predicts the most likely trace and schedules it like a big basic block, then repeats this process for all other possible branch outcomes. The compiler may also perform other sophisticated code analyses and tricks, such as loop unrolling and IF-conversion (which temporarily removes all branches from the section being scheduled). Where a RISC might speculatively execute code, a VLIW compiler actually moves that code up before the
predicted branch, while preserving enough program state to undo the moved code if necessary.
Proper VLIW hardware design can offer some support to the compiler. For example, a multiway branch operation allows several branches to go into a single wide instruction and perform during the same cycle. Also, conditionally executed operations, whose execution depends on the results of a previous operation, can replace many explicit software branches altogether.
The price to pay for VLIW's increased execution speed is much slower compilation and more expensive compilers. One of the few currently available, Archelon's Rocket C for Sun, costs $10,000.
The Downside of VLIW
VLIW faces other big obstacles. A VLIW compiler must have an intimate knowledge of the hardware details of its processor, down to the number of function units and even their individual latencies. So launching your next-generation CPU with more (or even just faster) units will probably break all the old software,
which will require recompiling everything. Had the 486 forced everyone to throw out their 386 software, Intel's balance sheet would undoubtedly have reflected the change.
VLIW advocates suggest a two-stage compilation process. All software would come in a hardware-independent intermediate code that translates into native code only after installation on the user's machine. The OSF's Architecture-Neutral Distribution Format (ANDF) shows that such a system can work. However, while cross-platform software is a desirable goal, PC software developers are often slow to adopt radically new technologies.
Another issue arises over the static nature of VLIW compiler optimizations. How well will such programs perform when faced with dynamic run-time events (such as waiting for I/O) unforeseen at compile time? VLIW arose to meet the needs of scientific number crunching, but it might prove less capable on the sorts of object-oriented and event-driven programs that are more common in the PC community. Not onl
y that: How can you verify that a compiler performing such extensive transformations will preserve the correctness of your programs? The truth is, nobody knows. VLIW compilers are still primarily an
objet de recherche
.
So will the Intel/HP VLIW gamble pay off? They've already started to hedge their bets about moving to a purely VLIW architecture. Intel now intends to produce a version of the P7 that's a straight successor to the Pentium Pro, directly executing x86 instructions. HP will work on a VLIW version of P7 that emulates both x86 and PA-RISC instructions. Target speed: 1 billion instructions per second.
Should Intel/HP's VLIW adventure not pan out, it certainly won't be the first time--nor will it be the last. The intricacies of coordinating VLIW hardware and software offer challenges that have eluded researchers before. It should come as no surprise that the lure of ever-greater speed may sometimes lead down blind alleys.
A VLIW processor like the generic one illustrated above should execute eight operations per cycle on most cycles--with a 200-MHz clock it would be 50 to 100 percent faster than current superscalar chips. Unfortunately, such performance requires the compiler to know intimate hardware details, like the latency of each function unit.
A:
Adding extra function units can increase performance (by reducing resource conflicts), with little effect on overall complexity. However, physical limits restrict such expansion: limited read and write ports onto the register file (which requires simultaneous access from all function units), and interconn
ections that rise geometrically with the number of function units. Also, the compiler must find enough parallelism in the program to warrant any extra units.
B:
This hypothetical 256-bit-wide instruction word has eight operation fields, each one a traditional three-operand RISC-like instruction:
. In practice, extra bits may hold immediate values. Each operation field can directly drive a specific function unit with minimal decoding.
Dick Pountain is a BYTE contributing editor based in London. You can reach him at
dickp@bix.com
.
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it
is
theoretical--and no language better exemplifies this than C++.
BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin,
and dozens of other CMP publications—bringing
you critical news and information about wireless communication,
computer security, software development, embedded systems,
and more!