Imagine that you are trying to implement a simple climate model on a distributed-memory parallel machine. There is no need to describe climate modeling in detail here; you can look at parallelization without understanding much about the physics being modeled. This example outlines a simple-minded approach to the problem and explains why it may or may not perform well.
Creating the Model
First, divide up the earth's surface using a 2-D grid. Each grid "square" represents, say, 5 degrees of latitude and 9 degrees of longitude, giving a total of 1440 grid locations covering the earth. (They aren't actually squares, of course.) The model will track various parameters for each square: water vapor, temperature, radiation absorption, and the amount of
sunlight reflected by the surface. (This simplified model might bother a real climate researcher, because their 2-D models usually divide the earth into latitudinal bands and the atmosphere into vertical layers.)
In this model, each iteration corresponds to a given amount of simulated time--say one hour--and consists of two steps. First, neighboring grid squares exchange information. A hot square might leak some of its heat to a cooler neighbor, or perhaps wind currents lead to the movement of water vapor between squares. Then, after each square gets the information it needs from its neighbors, you have the system figure out what happens within the square and update its local state. To keep track of what the model is doing, various summary results for each iteration are computed and stored; you can view those results after the program finishes to find out what happened.
Analyzing the Parallel Possibilities
The first step in parallelizing the model is to figure out how to decompose the problem
. This application, like many physical models, has good locality--a grid square directly influences its neighbors, not squares on the other side of the world. To take advantage of that property when decomposing the application onto different processors, you allocate contiguous blocks of grid squares. If you chose instead to allocate individual squares randomly, this would introduce much more communication when the squares tried to talk to their neighbors.
This example uses a simple parallel architecture with a master host processor that controls 16 slave processors. The costs are about equal for communicating between any pair of processors. The listing "Master/Slave Parallel Processing" shows how the application might be structured on such a machine. So how much faster will this application run on 16 processors rather than on just one? That depends.
Your biggest ally is locality; the "area-perimeter" rule says that if you only communicate at the edges of a rectangle, the ratio of communication t
o computation improves as the rectangle gets bigger. That's because the amount of data sent is proportional to the length of the perimeter (2*A + 2*B for an A by B rectangle), while the amount of work grows proportionally to the area (A*B). You divided the problem up into a single region per processor, so the rectangles are the largest possible size for your machine.
On the other hand, three problems might trip you up. The first is the cost of communicating between processors. Suppose it takes 1 second for every processor to exchange data with the processors handling neighboring grid points. Ideally, the cost of computing a timestep will be much more than a second. If it were exactly 1 second, you spend only half your time doing useful work. If it takes one-fifteenth of a second, then even if all host-slave communication is free, the 16 processors working together won't go any faster than just one processor acting alone.
The second potential problem is that control is centralized at the host pro
cessor. The slave processors have to wait until the host tells them what to do, and all global processing between timesteps happens on the host while the slaves remain idle. Even if the host is fast enough to keep 16 slaves busy most of the time, that may not remain true when more slave processors are available.
Balancing the Workload
Finally, load-balancing problems can easily cut overall performance in half, or worse. You have allocated an equal number of grid squares to each processor; if each square takes the same amount of time to compute, that was a good decision. Even if they vary a bit, you have many grid squares per processor so that will help even things out. But suppose you have an elaborate strategy for modeling ice; when the temperature is low enough, you spend a lot of time figuring out ice formation, thickness, and the fraction of land covered. Since ice doesn't cover much of the Sahara, a few parts of the world will require a lot of processing for the ice model while others will not
need any. The processors that cover the Pacific Ocean or Africa will finish their processing quickly and sit around waiting while the processors handling the poles work on the ice model. This unbalanced load hurts performance.
There are two ways to improve load balance: Do a better static decomposition of the problem, or move to dynamic scheduling. Static means that you divide the problem once at the beginning and never change your mind. You might be able to finesse the problem with ice, because the parts of the world that are covered with it are fairly predictable. Rather than giving each processor the same number of grid squares, you could give the ones managing land near the poles fewer squares. That evens out the load so that there won't be as much variance in the time it takes each processor to finish.
A more elaborate strategy for balancing load requires a much more complicated way to manage parallelism. Dynamic scheduling shuffles grid squares between processors as the application runs; w
hen one processor is getting its squares done much faster than another, some of the work is shifted, so the load evens out. Dynamic scheduling is tricky to implement well, though, and it can lead to a lot of overhead if you are not careful.
Listing: Master/Slave Parallel Processing
master_code()
{
/* setup */
start up one slave per slave processor
broadcast to each slave processor the grid squares it will
handle
distribute initial data to each slave
/* start modeling */
for i = 1 to max_timesteps
call slave_code() ;tell each slave to compute one timestep
wait until slaves finish and report results
compute and store summary information from the results
tell each slave to die
}
slave_code()
{
/* setup */
wait for master to say which grid squares to handle
wait for initial data
/* start modeling */
do forever:
get message from host
if asked to die, commit suicide
compute_timestep()
send local-state info to host
}
compute_timestep() {
send interaction information to each neighbor
wait until each neighbor's information arrives
model physics in each grid square, using neighbor info
}