Complex queries on distributed databases can consume massive resources. Here's one strategy for making them more efficient.
John L. Cuadrado
"Where's my data? I need it now!" That's the plaintive cry -- or maybe the angry demand -- of managers everywhere, who need more and more data to keep their organizations running. More sophisticated about data than ever, these managers are asking questions that have become more complicated and involve more factors. This is particularly true in areas such as decision support and deductively augmented database systems, and answering these queries often requires a large number of joins.
But performing those complex queries on large databases can be very time-consuming. Distribute those databases among multiple machines, and the problems multiply -- but so do the p
ossibilities. Multiple machines give us the ability to execute many operations in parallel. And we're now beginning to encounter multiprocessor computers that do parallel processing themselves, as well as new microprocessors that employ on-chip parallel pipelines.
More Efficient Joins
To take full advantage of this new multiprocessing capability, however, we have to arrange joins and other operations efficiently. Software that takes advantage of parallel processing is hard to find at the moment, but it's beginning to appear. We can expect the major database vendors to offer parallel versions of their database engines in the near future. The goal is always to achieve a radical speed increase in query response. This quick tour of some of the problems in the area of parallel queries will show some of the strategies you can use to determine the most efficient way to execute your queries.
Much of this discussion will be in the realm of abstract mathematical notation. To make
it a bit more concrete, we'll first consider a real-world example, a scenario in which large amounts of data are gathered and several complex queries must be run on a regular basis.
Let's assume you're a database administrator at a large commercial weather organization. At its headquarters, your company collects data every half hour on the local conditions from 10,000 weather stations worldwide. For this example, we'll simplify the data and just consider the temperature, barometric pressure, and humidity components. But even this limited view means that every day you receive 480,000 data transmissions, and for each one you need to record the time, the sending station, and the relevant weather parameters.
Due to the kinds of queries that your analysts ask of this database and the kinds of computing hardware available, you have decided to organize the data as four separate tables (location, temperature, pressure, and humidity), as shown in the table "Weather-Station Database." The actual physical
distribution of these relations among a set of processors -- in other words, where you actually store the data -- is a different matter that we'll discuss later.
With this data, you might make such requests as the following.
-- Find all stations reporting temperatures below 10 degrees C and
report on their relevant weather parameters.
-- Find all stations between the latitudes of 30 degrees north and
50 degrees north that show temperature fluctuations of 10 degrees
or more during the hours of 5:00 a.m. and 7:00 a.m. over the past
three months.
But you can also imagine far more complex queries that look for more subtle patterns: For example, you might need to find clusters of weather stations that have comparable readings over a given period. Queries like these can be extraordinarily expensive in terms of time and resources. This is where the ability to do queries in parallel can provide a distinct advantage. But how do we structure these queries so th
at they can be done in parallel? This is a central problem.
Notation for Joins
Here's an introduction of some notational conventions we can use to sort out the operations and relationships involved.
A, B
, and
C
are attributes (e.g., temperature or longitude in our example). We denote relations between attributes as
r(A,B)
and
s(B,C)
. Next, we define the natural join of
r
and
s
, denoted
r % (subscript B) s
, to be a relation on
A,B,C
that contains all the tuples (or rows) that result from concatenating tuples in
r
with those tuples in
s
that have identical values for the attribute. For example, in the weather-station database, we might be interested in taking a look at StationID = 123566789 and asking for Location
% (subscript StationID)
Temperature.
Now, a query generally involves creating a new relation using a join; we can represent it in the form
q = r % (subscript B)
s
. Three basic strategies have been developed over the past 20 years to compute the new relation: the nested-loops join, the sort-merge join, and the hash-based join. (These are called
uniprocessor join
strategies, since they assume that a single computer is performing the operations.) Many variations on these basic strategies have been developed that take into account page sizes and caching schemes of varying orders of sophistication. Here, we'll consider only the following basic algorithms:
-- The
nested-loops join algorithm
is based on the definition of
the join and is computed using two nested loops that sweep
through the two relations.
-- The
sort-merge join algorithm
first sorts the two relations to
be joined and then merges the results using the matching tuples
as the selection criteria.
-- Finally, the
hash-based join algorithm
consists of partitioning
the
r
relation into
n
buckets, using hashing o
n the attribute,
and doing the same for the
s
relation. We then make passes over
each bucket.
(For a thorough analysis of these join algorithms,
see reference 1
.) From this point on, we'll simplify the notation and omit the attributes over which we're computing the join. That is, for two relations
r (subscript 1)(A,B)
and
r (subscript 2)(B,C)
, instead of
r (subscript 1)(A,B) % (subscript B) r (subscript 2)(B,C)
, we'll simply write
r (subscript 1) % r (subscript 2)
.
Defining Our Model
For this example, we make the basic assumption that the parallel system consists of a set of homogeneous processors
P (subscript 1), P (subscript 2), ..., P (subscript m)
communicating over a high-speed, fully connected data network. No further assumptions about the processors or the network are made. It may surprise some that we are using this simple distributed model using off-the-shelf processors
. However, one of the lessons we've learned in the past 20 years is that we don't gain much from special-purpose database machines. On the contrary, the current trend is toward the so-called share-nothing architectures. (For a detailed review,
see reference 2
.)
We begin with a database
R
, which has a number of relations
r (subscript 1), r (subscript 2), ..., r (subscript n)
that are distributed among the processors
P (subscript i)
such that each relation is fully contained in one of the processors. A given processor may contain more than one relation. A typical configuration is shown in the figure
"Distribution of Relations"
.
We have four relations that are stored in three processors. For example, the relations
r
could be Location, Temperature, Pressure, and Humidity in our weather-station database. For the time being, we'll assume that to compute the join of any of two relations in the system, both must reside in the s
ame processor. Thus, to compute the join of
r (subscript 1)
and
r (subscript 2)
, we can just go ahead and use one of the uniprocessor join techniques described above, since both relations are already stored in the same processor.
However, if we want to compute the join of
r (subscript 3)
and
r (subscript 4)
, then we have to move one of the relations to a different processor. Either we move
r (subscript 3)
to
P (subscript 3)
or we move
r (subscript 4)
to
P (subscript 2)
. Which move should we make? This can make a significant difference in the amount of work required to perform the computation. If relation
r (subscript 3)
, for example, is much larger than relation
r (subscript 4)
, then it'll be far more cost-effective to move
r (subscript 4)
.
This is the nub of the general problem here--the order of computing multiway joins. To keep this presentation simple, we'll continue to work with the example in
the figure
"Transition System for a Three-Processor, Four Relation Multijoin"
. Our goal is to compute the join of the four relations
r (subscript 1) % r (subscript 2) % r (subscript 3) % r (subscript 4)
. Since the join operation is associative, there are many different ways to compute the multiway join operation. So, even in this simple case, we could proceed in different ways. For example, we could compute
r (subscript 1) % r (subscript 2)
, join the result with
r (subscript 3)
, and finally join this result with
r (subscript 4)
. Another strategy would be to compute the join of
r (subscript 3)
and
r (subscript 4)
on one machine, in parallel compute the join of
r (subscript 1)
and
r (subscript 2)
on another machine, and finally join their results. For more information, look up Catalan numbers in a good book about algorithms.
Parallel Join Strategies
To determine the best course of act
ion, we need to model what it means for one solution to be better than another. We need to assign a cost to each solution, which means we need a method to compute the costs of intermediate steps in a given computation. One way to do this is to set up a transition system, where each state of the system corresponds to an intermediate state, and a prescription that tells us how to go from one state to another.
With this aim in mind, let's go back to our query
q(X (subscript 0)) = r (subscript 1) % r (subscript 2) % r (subscript 3) % r (subscript 4)
. This
q(X (subscript 0))
notation will become clearer in a moment. Each state in our transition system is represented by an assignment of relations to processors. For example, the state represented in the figure
"Distribution of Relations"
is
X (subscript 0) =
. That is, relations
r (subscript 1)
and
r (subscript 2)
reside in
processor
P (subscript 1)
, relation
r (subscript 3)
is in processor
P (subscript 2)
, and relation
r (subscript 4)
is in processor
P (subscript 3)
. This represents the initial state in our transition system.
Now, where can we go from this state? One possible move would be to the state
. Here we have moved relation
r (subscript 3)
to processor
P (subscript 3)
and performed the join with relation
r (subscript 4)
. The figure
"Transition Space"
represents a small section of the transition space we are describing here.
We also need to associate a cost with each transition. From our current state
X
, there are a finite number of states we can move to. Let's call the next state we want
Y
. Now we need to describe the cost of going from state
X
to state
Y
, which we'll call
(X,Y)
. In gene
ral, the cost depends on the sizes of the relations to be joined and on the costs of one or more relations to the appropriate processors. The transition model we are discussing can take into account the parallelism that is potentially available. In the above example, for example, we can perform the joins
r (subscript 1) % r (subscript 2)
and
r (subscript 3) % r (subscript 4)
in parallel.
So, now we have a transition system associated with our query
q(X (subscript 0)) = r (subscript 1) % r (subscript 2) % r (subscript 3) % r (subscript 4)
. Furthermore, each transition has an associated cost. A final state in the transition system consists of a state that has the join relation
r (subscript 1) % r (subscript 2) % r (subscript 3) % r (subscript 4)
residing in one of the processors. For the current query, we have three final states; that is, the answer can be in processor
P (subscript 1)
,
P (subscript 2)
, or
P (subscript 3)
. Our goal is to find
a path that takes us from the initial state
X (subscript 0)
to one of the three final states containing the answer--and that does so at minimum cost.
Finding a Minimum-Cost Solution
First, let's consider the entire three-level transition system, as shown in the figure
"Transition System for a Three-Processor, Four-Relation Multijoin."
There is one initial state, 11 intermediate states, and three final states. We'll present a cost model for transitions and establish criteria for determining a minimum-cost path through the system.
A realistic cost model for each transition has to take into account both the cost of performing a local join and the cost of the transmission of data among the different processors. In general, we will allow only transitions that accomplish some processing that gets us closer to one of the final states. That is, in going from a state
X
to a state
Y
, we must either perform some local joins or move
some relations from one processor to another or perhaps do a combination of both actions. To make the example more concrete, assume that the cardinalities (i.e., the number of tuples) of each of the relations are as shown in the table
"Size of Relations"
.
Let's illustrate how the cost computation proceeds here. We'll use the convention that the initial state of the system is labeled
X (subscript 0)
, the three final states are labeled
F (subscript 1), F (subscript 2)
, or
F (subscript 3)
(depending on where the final join relation
r (subscript 1) % r (subscript 2) % r (subscript 3) % r (subscript 4)
winds up -- in processor
P (subscript 1), P (subscript 2)
, or
P (subscript 3)
), and the intermediate states are labeled
Y (subscript 1)
through
Y (subscript 11)
. We will assume that, associated with each join
r % s
, is a cost
alpha | r | X | s |
, where
alpha
is a constant that measures the sel
ectivity of the join. Also, the costs associated with moving relation
r
from one processor to another will be
beta |r|
, where
beta
is a constant that measures the cost of transmission in the network. With these notations understood, we can compute the following costs:
<click here for equation one>
In going from state
X (subscript 0)
to state
Y (subscript 2)
, we only incur the cost of performing the join between
r (subscript 1)
and
r (subscript 2)
located at processor
P (subscript 1)
.
In going from state
X (subscript 0)
to state
Y (subscript 1)
, we do the join of
r (subscript 1)
and
r (subscript 2)
at processor
P (subscript 1)
and, in parallel, move relation
r (subscript 4)
from processor
P (subscript 3)
to
P (subscript 2)
and do the join of
r (subscript 3)
and
r (subscript 4)
to obtain
<click here for equation two>
As a final example, in going from
Y (subscript 4)
to
Y (subscript 9)
, we move
r (subscript 2) % r (subscript 3)
from processor
P (subscript 3)
to
P (subscript 4
and do the join
r (subscript 2) % r (subscript 3) % r (subscript 4)
, so the cost becomes
<click here for equation three>
Having computed all the one-step transitions, we can now use any of a variety of algorithms to find a minimum path. This problem has been studied for many years, and books on algorithms provide detailed descriptions of the techniques.
Next Steps
One possible approach is to use the technique of dynamic programming, where we find optimal solutions for subproblems by finding optimal solutions to sub-subproblems, and so on. At some point, we get down to one-step transitions, and we just look up their costs in a table.
Learning the details of the
dynamic programming approach is worthwhile, since this is a very general approach that finds application in many areas of computer science and operations research. There is a stochastic version of the material we've discussed in this article, and here, too, the dynamic programming approach has been used to solve this class of problems.
It may seem as though a lot of setup and computation goes into producing a strategy for one of these multijoin queries. This is definitely true. And it's important to remember that, in queries that require the joins of many relations (over 10, say), the brute-force approach of simply enumerating all the possible options and then choosing the minimum-cost strategy is basically impossible. Also, the job we do in selecting a good strategy can make the difference as to whether we get any answer back at all. In many large systems, it is relatively easy to generate queries that take days to complete. Therefore, it's important to plan multijoin queries intelligently.
More Work Ahead
The example given here was intended to provide a flavor for the issues that are involved in creating an efficient set of parallel query mechanisms. At the present time, there is a relatively large amount of literature dealing with many of the topics discussed in setting up our simple three-processor/four-relation example. (For a review of some of the products that are currently available in this market,
see reference 3
.) Techniques similar to those in the dynamic programming approach that we used to compute an optimal query strategy are applicable in the area of distributed object-oriented databases.
In an ODBMS (object-oriented database management system) we have objects that are often stored across several processors. For example, we might have a large image database that contains not only the raw images but also semantic information associated with each image. A typical query might ask for all images that satisfy certain semantic criteria. Su
ch a query could be handled most efficiently using techniques similar to those discussed in our simple example.
Although considerable work remains to be done in this area of query optimization, many companies and academic centers are working on the problem, and funding in this area appears to be quite good.
REFERENCES
1. Valduriez, P. and G. Gardarin. "Join and Semijoin Algorithms for a
Multiprocessor Database Machine," ACM Transactions on Database
Systems, Vol. 9, no. 1, March 1984: 133-61.
2. DeWitt, D. and J. Gray. "Parallel Database Systems: Future of High
Performance Database Systems," Communications of the ACM, Vol. 35,
no. 6, June 1992.
3. Ferguson, M. "Parallel Database: The Shape of Things to Come,"
Database Programming and Design, October 1994: 32-44.
WEATHER-STATION DATABASE
10,000 stations collected dat
a every half hour. Results are listed
below.
LOCATION
=========================================================================
STATIONID NAME LATITUDE LONGITUDE ELEVATION
=========================================================================
123566789 NOAH, Evergreen 12 30 N 78 W 200
456334456 SPRITE, Cabo Verde 70 S 80 W 1000
TEMPERATURE
=========================================================================
STATIONID DATE TIME TEMPERATURE (C)
=========================================================================
123566789 10/30/94 2100 10.6
456334456 10/16/94 1500 -20.7
PRESSURE
=========================================================================
STATIONID DATE TIME PRESSURE
=======================================================================
==
123566789 10/30/94 2100 31.78
456334456 10/16/94 1500 29.04
HUMIDITY
=========================================================================
STATIONID DATE TIME HUMIDITY
=========================================================================
123566789 10/30/94 2100 60.7
456334456 10/16/94 1500 80.9
illustration_link (19 Kbytes)

Here's a complete graphical representation of all the ways in which we can move from a given i
nitial state X (subscript 0), where the data is in four relations that are physically stored in three different systems, to one of three possible end states (F (subscript 1), F (subscript 2), F (subscript 3)) in which all the relations have been joined into a single relation residing in one processor.
Along the way, there are 11 possible intermediate states. We move from one state to another by performing one of three possible actions: joining two relations already in a single processor, moving a relation to another processor and joining it with a relation already there, or doing any two of these in parallel.
For simplicity's sake, this diagram uses 1 to represent the relation r (subscript 1), and so on. While some paths from X (subscript 0) to the end states are clearly shorter than others, we can find the most efficient path (and thereby the most efficient end state) only by calculating the costs of the operations involved for each state change. For example, in joining a small relatio
n with a large one, where each resides on a different machine, it's obviously more efficient to move the small one than to move the large one.
illustration_link (8 Kbytes)

Four relations are split up among three different processors connected either directly or through a network.
illustration_link (2 Kbytes)

This shows how we can move from one state to another by moving one relation and/or joining it with another.
illustration_link (1 Kbytes)

illustration_link (2 Kbytes)

illustration_link (1 Kbytes)

illustration_link (4 Kbytes)

The cardinalities of each of the relations.
John L. Cuadrado is an independent consultant who lives in Maine. His primary areas of interest include distributed database systems, scien
tific visualization, applied AI, and theories of parallel computation. You can reach him on the Internet or on BIX c/o
editors@bix.com
.