New computers using multiple CPUs may increase throughput considerably --providing the compiler understands how to build code for the specific machine
Oliver Sharp
Two heads are better than one--or at least they can be. With this in mind, computer designers and programmers have long sought to harness processors together so that time-consuming applications will execute faster. During the past few decades, researchers and manufacturers have created wildly different parallel-processing architectures. This has kept life interesting for applications developers, who must learn how to use the new machines. The primary burden, however, falls on systems developers and compiler makers to create tools that exploit the new capabilities.
Four main types of parallel machines are VLIW
(very long instruction word), distributed-memory, shared-memory, and data-parallel (see the text box "Types of Parallel Machines"). Each type has its own set of implications for compiler design and use.
A variety of programming models have been proposed for building parallel applications. In some cases, the programmer does all the hard work, while in others, the compiler is responsible. Some models leave the parallelism exposed and obvious, some force the programmer to express it manually, and others require the compiler to dig it out. The same holds for scheduling onto the available processors.
Dusty Deck Model
The simplest solution is the dusty deck model, the one programmers like best. It provides a -parallel option on the compiler. If you feed an existing program through the compiler, efficient code pops out for your parallel machine. This idea is the holy grail of parallel compilation and has been enthusiastically pursued by the research community. Unfortunately, the results have not be
en encouraging. (The term dusty deck is a rueful homage to the thousands of huge, time-consuming applications, written on punched cards in languages such as FORTRAN, which have been running for decades.)
Recompiling these dusty decks is hard because it is difficult to figure out what they do, how to transform them, and how to decompose them for good parallel performance. Before the compiler can exploit parallelism, it needs to know how the existing code behaves. This is a major problem with most languages now in use. It's easy to translate array operations into correct executable code but much harder to summarize array behavior at a higher level.
In a complicated loop, for example, the compiler is faced with dozens of lines of code that invoke procedures, walk over arrays, and jump around unpredictably. What data does each iteration touch? Could iterations be executed in parallel? How balanced is the cost of execution across iterations? These are hard questions to answer.
Nor is it enough
to just understand the code as it stands. Anyone who has "parallelized" applications knows that good performance usually requires changing the code's behavior. Some parts must be recoded entirely, data structures modified, or array traversal changed. And compilers aren't good at any of those tasks.
Even after you have fully analyzed and transformed a program, you aren't done. You must still decompose operations for the parallel architecture, making scheduling decisions and embedding them into the executable file (see the text box "Converting an Application for Parallel Processing"). The conversion requires the programmer, the compiler, or both to make a lot of nontrivial decisions about how to break up the problem.
Although the fully automatic approach hasn't worked well, researchers have developed many systems for interactively analyzing, transforming, and decomposing sequential applications. These make the task less onerous but are far from being turnkey solutions that shield the programmer f
rom the complexities of parallelism.
Message-Passing Libraries
Message-passing libraries move you from one extreme to the other. With them, you forget the compiler and manage the parallelism yourself. Most applications that run on distributed-memory machines today use message-passing libraries that give the programmer full control over decomposition and scheduling. These libraries provide routines to distribute tasks to processors and send data back and forth across the machine. Some libraries support features such as asynchronous communication; you hand them a callback routine that will be invoked the next time a message arrives.
The compiler doesn't do much with these kinds of applications; they are treated like ordinary sequential programs. Writing an application using message passing is like programming in assembly language; you get control of the system whether or not you want it. You also spend much time manually shipping data around the system.
Debugging becomes a terrible job.
A program's behavior can change every time it runs. Worse, many bugs depend on timing. If a message arrived in time, the program reads correct data; if not, you get garbage. Also, the bug may occur on some runs but not others. Adding debugging code can change the timing enough to mask it. If you take the test code out, the bug reappears; this can be maddening.
The Shared-Memory Model
The shared-memory model is the usual way to program a shared-memory machine. A copy of your executable file runs on each processor; all share the same data. They communicate with one another the same way multiple threads do; they can set or release locks or use semaphores.
A "bag of tasks" is a simple strategy that's easy to code with the shared-memory model. Divide the work into separate tasks and put the tasks into a bag, or a list. The different processors draw work from the list; when one processor finishes a task, it picks up another from the list. To access the list safely, you must use a semaphore to reque
st exclusive access. When the request returns, you own the list and can remove the first item on it. You release the lock and start on the task.
The advantage to this model is that the programmer doesn't have to move data. Just by referring to a memory location, you arrange for the data to be available to your processor. But don't ignore locality; even on a shared-memory architecture, there is a cost to data transfer, and you get poor performance if you don't pay attention to it.
Debugging is easier on a shared-memory model, because data is immediately accessible, and programs are less cluttered with explicit communication routines. Still, timing-dependent behavior is a serious problem.
As with message-passing libraries, the compiler usually doesn't help much. The programmer has to assign work to processors and manage the interactions, although the code is more compact than for a distributed-memory application. Some researchers have looked at using compiler transformations to improve perf
ormance; the problem is related to cache management on a sequential machine.
While this model is most commonly available on shared-memory machines, an ambitious goal is to supply it on other architectures. Here, the compiler becomes an integral player because a simple-minded implementation would yield dismal performance.
Hints from the Programmer
This method lets the programmer help the compiler. That's the idea behind a number of recent languages, including FORTRAN D, which provides keywords that let the programmer specify how to allocate data across a parallel machine.
Every processor gets a copy of the program and execute it together. Based on the programmer's directives, a processor owns some of the data. On a four-processor machine, for example, each CPU might own every fourth column of a large matrix.
As the program executes, the system decides which processor performs each computation according to the "owner-computes" rule. When the value of a computation is assigned to a
specific memory location, that location's owner performs the operation. Other processors must send any required data to the location's owner.
Now under construction, the FORTRAN D compiler faces the hard task of minimizing communications overhead. A modern processor can execute a floating-point operation in a tenth of a microsecond, but a message between processors takes much longer--tens or hundreds of microseconds. It is impractical for each instruction to wait for a message to provide data. The solution is for the compiler to look at a block of code (typically a loop) and know beforehand what data each owner will require. Before the application starts executing that code block, it arranges for each processor to make a bundle of data and send it to the owners that need it. Each owner can perform its computations without further communication. Because the programmer can easily produce code that is hard to analyze, the FORTRAN D user must learn to write programs the compiler can handle effectively.
Data-Parallel Programming Model
The data-parallel programming model provides the programmer with primitive operations that can be applied to an entire set of data at once. By composing these operations, you can perform complex and powerful manipulations of large amounts of data in a few lines of code. If you are familiar with APL, you have a good idea of what data-parallel programs can look like.
Data parallelism is appealing for many reasons. Since all the parallelism comes from the operations, the application has only one thread of control and is much simpler to debug. Programming languages look similar to their sequential cousins, although they need special features for defining data structures and parallel operations. There is one fly in the ointment, however: You must make sure that your program does most of its computation using data-parallel operations, because the rest of the time, you are just executing sequentially. Most existing applications must be completely redesigned, and many wi
ll not adapt well to a data-parallel style.
Compiling data-parallel languages can be easy or challenging. If your target machine is data parallel, you can design the language to provide the same primitives as the architecture. Compilation is a snap; you simply translate each data-parallel operation in the program into the corresponding machine-language instruction.
On the other hand, compiling a data-parallel program for a different architecture can be difficult. The figure "Stencil for Computing the Value of A[2,3]" shows a simple data-parallel operation in which each array element is computed based on the value of itself and of the two elements below it. The equivalent in a conventional language would be as follows:
for i = 1 to n-2
for j = 1 to n
A[i,j]=f(A[i,j], A[i+1,j],
A[i+2,j])
The figure also shows a shaded area representing the data used to compute A[2,3]. This data-parallel operation is very regular; the shape of the shaded area will be the same regard
less of the array element being computed. The most important task in compiling a data-parallel computation is to figure out that shape, which is called the stencil.
Here, the stencil includes no neighbors to the left or right, showing that columns are independent of each other. There is no advantage to allocating two neighboring columns to the same processor. Rows are a different story, however; assigning rows randomly to processors would be an extremely bad allocation strategy.
The stencil tells the compiler how much communication will be required by any data decomposition. The compiler must minimize communication overhead and balance load. However, compilers can deal with data-parallel models much more easily than conventional programs because the stencil can usually be determined automatically with good accuracy. Languages like FORTRAN often make it very hard (and sometimes impossible) to determine the stencil.
Other Language Models
Conventional languages are hard to analyze for par
allelism, and their storage model is mathematically inelegant. Why not toss them out entirely and start over? Many language designers feel that way. One school favors functional languages like Haskell and ML; it is easy to find parallelism in them. Others rely on unification with logic-programming languages inspired by Prolog. So far, none of these alternatives has become widely adopted for parallel programming, but improved implementations in the future may gain converts.
The Future of Parallel Compilation
It isn't easy to produce a program that runs efficiently on a parallel machine. If programming continues unchanged, compilers are unlikely to solve the problem automatically. Researchers will try to help by offering tools that ease the transition; they will also do their best to offer effective new programming models. There's nothing computer scientists like better than designing languages; they won't give up.
Figure: Stencil for Computing the Value of A[2,3]
An example of a simple da
ta-parallel operation.
Oliver Sharp works for Heuristicrats Research in Berkeley, California. He is completing his Ph.D. at the University of California at Berkeley, investigating compilation for parallel architectures. You can reach him on the Internet at
oliver@heuristicrat.com
or on BIX c/o "editors."