analytical processing (MOLAP) data cubes. The data volumes used for these applications can easily amount to several gigabytes.
Currently, MDD typically appears in physical experiments and medical-imaging, census, and industrial-research applications. But it's also becoming increasingly relevant for a large number of business applications, such as image retrieval from multimedia databases.
Traditional DBMSes employ binary large objects (BLObs), which are essentially flat files under the control of the DBMS, to hold MDD sets. But BLObs have a drawback: They don't mirror the array structures; they simply squeeze all items into 1-D byte strings that do not allow for any operations other than the reading and writing of entire files. Data-manipulation operations have to be performed
by applications outside the DBMS. "BLObs fail to give the service that advanced users need," explains Jian Zhou, a researcher at the Fraunhofer Institute for Computer Graphics in Darmstadt, Germany.
MDBMSes are still in their infancy and lack a generally accepted formal framework. But even though it may seem purely academic, it's important to remember that SQL started out as just a theory, too. And the stakes are high for the retrieval of MDD sets in database arrays. Says Frank Olken, database expert at the Lawrence Berkeley Labs, "Doing discrete Fourier transform [DFT] inside the database is my touchstone."
Cubical Worlds
Today's most prominent MDD application is OLAP, because it involves variable dimensions and data volumes in the gigabyte, and even terabyte, range. The main difference between OLAP and raster-imaging applications, another important application field of MDD, is that OLAP data cubes may well consist of 95 percent empty space. One of the main issues in OLAP is achieving ef
ficient compression of sparse cubes.
MOLAP systems, such as Arbor Software's Essbase and Oracle's Express, had a dazzling initial success. These systems offer compression and achieve fast aggregation through precomputed and stored subaggregates. Another advantage of MOLAP over relational OLAP (ROLAP) is direct representation of multidimensional information.
The ROLAP approach uses a relational DBMS (RDBMS) and stores each array cell value together with its array coordinates in a separate tuple. Several dimension tables holding the category hierarchies surround the central table with the cell array. This technique is effective due to the high sparsity of arrays.
A central problem with ROLAP, however, is the large number of joins between dimension tables and the central table, which researchers see as a clear sign for a semantic gap between the application and the database model. In fact, a similar gap between engineering applications and databases stimulated the takeoff of object-oriented DBM
Ses 10 years ago.
Arrays Are Not Objects
You might think that object-oriented database technology, which is basically a persistent extension of programming languages such as C++ and Smalltalk, offers a better array concept. Unfortunately, there is no array concept in these languages except for single-cell access. For arrays that are larger than a few kilobytes, object-oriented DBMSes (OODBMSes) offer only BLObs. As a result, OODBMSes do not support even the most basic MDD operations, such as extracting a rectangular area from an n-dimensional value.
What about object-relational technology? In brief, it allows for the implementation of new, array-like data types (which are basically nested 1-D records), but not for real data-type constructors instantiated with the corresponding spatial domain and a proper base type. Today's commercial object-relational DBMSes, such as Informix's Illustra and IBM's DB2, introduce a separate data type for 1-D arrays, 2-D arrays, and so forth.
Represent
ing a substantial step forward in object-relational technology is Cornell University's Predator experimental system. Its extended abstract data types (E-ADTs) allow you to define dedicated query sublanguages, advanced optimization rules, and storage-layout policies. In fact, it might turn out that systems such as Predator will eventually offer the fundamental mechanisms required for flexible and fast MDD retrieval.
Storage Hierarchies
Some MDD applications, such as those used in high-energy physics, generate data volumes of 10TB per day in a complex multiuser environment. Thus, MDBMSes have to meet enormous performance requirements that can be fulfilled only with a sophisticated storage concept.
NCR recently implemented a 24-TB data warehouse based on a farm of hard disks. However, hard disks are no longer sufficient in large research projects that generate tens of petabytes (PB) of data. Hard disk arrays cannot hold such vast amounts of data, if only for the sheer disk-failure rate. Even
with the larger disk capacities that are promised for the near future, hard disks will not be able to satisfy such demands.
The addition of on-line storage space on tertiary storage devices, such as tape cabinets, is therefore indispensable -- not as a backup medium, but as an additional storage medium as part of the database. The result is a storage hierarchy where access times and volume both grow as distance from the CPU increases.
Currently, researchers are experimenting to determine the best data-distribution algorithms for tapes and cabinets. They are thinking about replicating frequently accessed "hot spots" on secondary storage while keeping the bulk of data on tertiary storage. For example, experiments with intelligent tertiary storage management at the University of California-Berkeley have demonstrated that several orders of magnitude in storage performance can be gained over conventional sequential tape storage.
RasDaMan Reggae
Some research projects have reco
gnized the need for comprehensive database-array support. One such project is RasDaMan (short for Raster Data Management in Databases), sponsored by the European Commission's Esprit program.
The conceptual model of RasDaMan centers around the notion of an
n
-dimensional array (in the programming-language sense) of any dimension, size, and array-cell type. RasDL, RasDaMan's definition language, supports any valid C++ type or structure. Each dimension's lower and upper boundary can be either fixed at data-definition time or variable. The Raster Query Language, or RasQL (see the sidebar "The RasDaMan Query Language"), extends SQL-92 with array operators and is capable of performing OLAP and statistical and imaging operations. RasDaMan interfaces to other applications through an ODMG 2.0-compliant C++ library called RasLib.
RasDaMan uses a client/server architecture with server-based query evaluation. An intelligent query optimizer and a streamlined storage manager minimize network traffic and
storage access. The storage concept is based on a combination of flexible MDD subdivision, spatial indexing, and transparent compression.
While there may be no fixed structure for all MDD operations and objects, RasDaMan shows that subdivision of MDD data sets into arbitrary multidimensional rectangular tiles allows for efficient execution of the most common MDD operations. RasDaMan's rectangular tiling concept is based on usage statistics and most common access patterns. For each object, RasDaMan creates a spatial index that maintains all information about the tiles of the object and the corresponding spatial data.
The RasDaMan system is based on O2 Technology's (Versailles, France) O2 OODBMS. However, because all the array semantics are resolved inside the MDBMS, any DBMS can run underneath. This way, whatever DBMS an enterprise runs, the integrated management of array and conventional data in an MDBMS is a practical alternative.
No Commercial Systems Yet
Today's MDBMSes are eith
er specialized for particular applications, such as OLAP, or are still in the prototype stage. However, the first crop of general-purpose systems, such as RasDaMan, will become commercially available sometime during this year.
In addition to the improved retrieval functionality that they offer, fully fledged array DBMSes will result in a performance boost and the easier management of large data volumes. Furthermore, they will allow for thin clients running on inexpensive hardware instead of requiring costly data-crunching workstations.
Where to Find
Arbor Software
Bracknell, Berkshire, U.K.
Phone: +44 1344 664000
Fax: +44 1344 664001
Internet:
http://www.arborsoft.com/