Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesSorting Terabytes


January 1998 / International Features / Sorting Terabytes

Multidimensional database systems provide greatly improved performance with large data volumes.

Peter Baumann

Multidimensional database management systems (MDBMSes) are starting to receive serious attention. Why bother about MDBMSes? Simply put, they enable important image and data-retrieval applications and are easier to manage than other database types.

Data arrays of arbitrary size and dimension, which are known as mul tidimensional discrete data (MDD) sets, span a remarkably rich manifold of variants, including simple one-dimensional time series, 2-D and 3-D raster images, and multidimensional on-line analytical processing (MOLAP) data cubes. The data volumes used for these applications can easily amount to several gigabytes.

Currently, MDD typically appears in physical experiments and medical-imaging, census, and industrial-research applications. But it's also becoming increasingly relevant for a large number of business applications, such as image retrieval from multimedia databases.

Traditional DBMSes employ binary large objects (BLObs), which are essentially flat files under the control of the DBMS, to hold MDD sets. But BLObs have a drawback: They don't mirror the array structures; they simply squeeze all items into 1-D byte strings that do not allow for any operations other than the reading and writing of entire files. Data-manipulation operations have to be performed by applications outside the DBMS. "BLObs fail to give the service that advanced users need," explains Jian Zhou, a researcher at the Fraunhofer Institute for Computer Graphics in Darmstadt, Germany.

MDBMSes are still in their infancy and lack a generally accepted formal framework. But even though it may seem purely academic, it's important to remember that SQL started out as just a theory, too. And the stakes are high for the retrieval of MDD sets in database arrays. Says Frank Olken, database expert at the Lawrence Berkeley Labs, "Doing discrete Fourier transform [DFT] inside the database is my touchstone."

Cubical Worlds

Today's most prominent MDD application is OLAP, because it involves variable dimensions and data volumes in the gigabyte, and even terabyte, range. The main difference between OLAP and raster-imaging applications, another important application field of MDD, is that OLAP data cubes may well consist of 95 percent empty space. One of the main issues in OLAP is achieving ef ficient compression of sparse cubes.

MOLAP systems, such as Arbor Software's Essbase and Oracle's Express, had a dazzling initial success. These systems offer compression and achieve fast aggregation through precomputed and stored subaggregates. Another advantage of MOLAP over relational OLAP (ROLAP) is direct representation of multidimensional information.

The ROLAP approach uses a relational DBMS (RDBMS) and stores each array cell value together with its array coordinates in a separate tuple. Several dimension tables holding the category hierarchies surround the central table with the cell array. This technique is effective due to the high sparsity of arrays.

A central problem with ROLAP, however, is the large number of joins between dimension tables and the central table, which researchers see as a clear sign for a semantic gap between the application and the database model. In fact, a similar gap between engineering applications and databases stimulated the takeoff of object-oriented DBM Ses 10 years ago.

Arrays Are Not Objects

You might think that object-oriented database technology, which is basically a persistent extension of programming languages such as C++ and Smalltalk, offers a better array concept. Unfortunately, there is no array concept in these languages except for single-cell access. For arrays that are larger than a few kilobytes, object-oriented DBMSes (OODBMSes) offer only BLObs. As a result, OODBMSes do not support even the most basic MDD operations, such as extracting a rectangular area from an n-dimensional value.

What about object-relational technology? In brief, it allows for the implementation of new, array-like data types (which are basically nested 1-D records), but not for real data-type constructors instantiated with the corresponding spatial domain and a proper base type. Today's commercial object-relational DBMSes, such as Informix's Illustra and IBM's DB2, introduce a separate data type for 1-D arrays, 2-D arrays, and so forth.

Represent ing a substantial step forward in object-relational technology is Cornell University's Predator experimental system. Its extended abstract data types (E-ADTs) allow you to define dedicated query sublanguages, advanced optimization rules, and storage-layout policies. In fact, it might turn out that systems such as Predator will eventually offer the fundamental mechanisms required for flexible and fast MDD retrieval.

Storage Hierarchies

Some MDD applications, such as those used in high-energy physics, generate data volumes of 10TB per day in a complex multiuser environment. Thus, MDBMSes have to meet enormous performance requirements that can be fulfilled only with a sophisticated storage concept.

NCR recently implemented a 24-TB data warehouse based on a farm of hard disks. However, hard disks are no longer sufficient in large research projects that generate tens of petabytes (PB) of data. Hard disk arrays cannot hold such vast amounts of data, if only for the sheer disk-failure rate. Even with the larger disk capacities that are promised for the near future, hard disks will not be able to satisfy such demands.

The addition of on-line storage space on tertiary storage devices, such as tape cabinets, is therefore indispensable -- not as a backup medium, but as an additional storage medium as part of the database. The result is a storage hierarchy where access times and volume both grow as distance from the CPU increases.

Currently, researchers are experimenting to determine the best data-distribution algorithms for tapes and cabinets. They are thinking about replicating frequently accessed "hot spots" on secondary storage while keeping the bulk of data on tertiary storage. For example, experiments with intelligent tertiary storage management at the University of California-Berkeley have demonstrated that several orders of magnitude in storage performance can be gained over conventional sequential tape storage.

RasDaMan Reggae

Some research projects have reco gnized the need for comprehensive database-array support. One such project is RasDaMan (short for Raster Data Management in Databases), sponsored by the European Commission's Esprit program.

The conceptual model of RasDaMan centers around the notion of an n -dimensional array (in the programming-language sense) of any dimension, size, and array-cell type. RasDL, RasDaMan's definition language, supports any valid C++ type or structure. Each dimension's lower and upper boundary can be either fixed at data-definition time or variable. The Raster Query Language, or RasQL (see the sidebar "The RasDaMan Query Language"), extends SQL-92 with array operators and is capable of performing OLAP and statistical and imaging operations. RasDaMan interfaces to other applications through an ODMG 2.0-compliant C++ library called RasLib.

RasDaMan uses a client/server architecture with server-based query evaluation. An intelligent query optimizer and a streamlined storage manager minimize network traffic and storage access. The storage concept is based on a combination of flexible MDD subdivision, spatial indexing, and transparent compression.

While there may be no fixed structure for all MDD operations and objects, RasDaMan shows that subdivision of MDD data sets into arbitrary multidimensional rectangular tiles allows for efficient execution of the most common MDD operations. RasDaMan's rectangular tiling concept is based on usage statistics and most common access patterns. For each object, RasDaMan creates a spatial index that maintains all information about the tiles of the object and the corresponding spatial data.

The RasDaMan system is based on O2 Technology's (Versailles, France) O2 OODBMS. However, because all the array semantics are resolved inside the MDBMS, any DBMS can run underneath. This way, whatever DBMS an enterprise runs, the integrated management of array and conventional data in an MDBMS is a practical alternative.

No Commercial Systems Yet

Today's MDBMSes are eith er specialized for particular applications, such as OLAP, or are still in the prototype stage. However, the first crop of general-purpose systems, such as RasDaMan, will become commercially available sometime during this year.

In addition to the improved retrieval functionality that they offer, fully fledged array DBMSes will result in a performance boost and the easier management of large data volumes. Furthermore, they will allow for thin clients running on inexpensive hardware instead of requiring costly data-crunching workstations.


Where to Find


Arbor Software

Bracknell, Berkshire, U.K.
Phone:    +44 1344 664000
Fax:      +44 1344 664001
Internet: 
http://www.arborsoft.com/



Cornell University

Ithaca, NY, U.S.
A.
Phone:    +1 607 255 7316
Internet: 
http://www.cs.cornell.edu/Info/Projects/PREDATOR/



FORWISS

Munich, Germany
Phone:    +49 89 48095 200
Fax:      +49 89 48095 203
Internet: 
http://www.forwiss.de/



IBM

Herrenberg, Germany
Fax:      +49 70 32 15 33 00
Internet: 
http://www.ibm.de



Informix

Ismaning, Germany
Phone:    +49 89 996130
Fax:      +49 89 99613 800


O
racle Germany

Munich
Phone:    +49 89 1 49770
Fax:      +49 89 149 77 570


O2 Technology

Versailles, France
Phone:    +33 1 30847777
Fax:      +33 1 30847790
Internet: 
http://www.o2tech.fr




Information on products in the data management category HotBYTEs - information on products covered or advertised in BYTE


Peter Baumann is assistant research group head at FORWISS (Munich, Germany) and technical manager of the RasDaMan project. You can contact him by sending e-mail to bauma nn@forwiss.de .

Up to the International Features section contentsGo to previous article: Certification Infrastructures for EuropeGo to next article: MDBMSes at WorkSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network