Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesWarehouse Cornerstones


January 1997 / State Of The Art / Warehouse Cornerstones

Successful data warehouses start when you choose and successfully integrate three key elements.

Jay-Louise Wheldon

Assembling the bricks and mortar of a conventional warehouse looks easy compared to integrating the server hardware and DBMSes that frame a data warehouse. On the hardware side, you must mix and match server platforms and configurations while deciding how to take advantage of almost-constant leaps in raw processing power. On the software side, the complexity and high cost of DBMSes force you to make tough decisions and inevitable trade-offs with regard to integration issues, support requirements, performance, efficiency, and reliability.

What if you choose incorrectly? Your data warehouse becomes an enterprise-wide problem that is difficult to work around, expe nsive to fix, and hard to justify. To get your warehouse implementation off to a successful start, you need to focus on three key building blocks: the overa ll warehouse architecture, the server architectures, and DBMSes. Here are some guidelines for making the right choices for your enterprise.

The Blueprint

Data-warehouse development begins with the logical and physical structure of the warehouse database plus the services required to operate and maintain it. This choice drives the selection of the other two fundamental items: the server hardware and the DBMS.

The physical platform can be centralized in one location or distributed regionally, nationally, or internationally. One scheme for housing your company's data, which might be gathered from multiple internal and external sources, is to consolidate the database in one integrated data warehouse (see the figure "Warehouse Foundation" ). The consolidated approach helps to ma ximize your available processing power. A second approach, the federated architecture, distributes information by function, with financial data on a server at one site, marketing data at another location, and manufacturing data at a third venue (see the figure "Two Alternatives to a Centralized Architecture" ).

The third approach, a tiered architecture, houses highly summarized data on one user's workstation, with more-detailed summaries on a second server, and the most detailed information on a third. The first-tier workstation handles most requests for data, with successively fewer requests passing on to tiers 2 and 3 for resolution. Computers at the first tier can be optimized for heavy user loads and low data volume, while servers at the other tiers are more adept at processing heavy data volumes but lighter user loads.

As you're deciding on a centralized or distributed warehouse structure, you also need to consider the servers that will hold and deliver the data. The size of your implementation -- and your enterprise's needs for scalability, availability, and systems management -- all influence your choice of server architecture (see the figure "Server Architectures" ).

Single-processor servers are the easiest to manage, but they offer limited processing power and scalability. In addition, a single server presents a single point of failure, limiting the guaranteed availability of the warehouse. You can expand single-server networks via distributed architectures that make use of middleware, such as Distributed Computing Environment (DCE) or Common Object Request Broker Architecture (CORBA), to distribute traffic across multiple servers. These architectures increase availability as well, since operations can be shifted to a backup server if one server fails, but systems management is more complex.

Symmetric multiprocessing (SMP) machines increase throughput by adding processors that share the server's internal memory and disk-storage devices. You can bu y most SMP in minimal configurations (i.e., with two processors) and upgrade when necessary to handle growth in your processing needs. The scalability of an SMP machine reaches its limit at the maximum number of processors supported by the connection mechanisms (i.e., the backplane and the shared bus).

A massively parallel processing (MPP) machine connects an array of processors via a high-speed, high-bandwidth link. Each node is a server, complete with its own (possibly SMP) processor and internal memory. To optimize an MPP architecture, applications must be "parallelized" (i.e., designed to operate in separate, parallel pieces).

This architecture is ideal for searching large databases. However, the DBMS that you select must be one that offers a parallel version. And even then, substantial design and tuning are required to achieve optimum data distribution and to prevent hot spots or data skew (where a disproportionate amount of the processing is shifted to one processing node due to the par tition of data under its control).

The difficulty of moving applications and DBMSes to clustered or truly paral-lel environments has led to newer architectures, such as nonuniform memory access (NUMA). NUMA creates a single, large SMP machine by connecting multiple SMP nodes into a single (though physically distributed) memory pool and a single instance of the OS. NUMA allows the SMP approach to achieve the performance benefits of large MPP machines (with 32 or more processors) while maintaining the simplicity and management advantages of a standard SMP environment. Most appealing of all, existing DBMSes and applications can be moved from single-processor or SMP platforms to NUMA with no modifications.

DBMS Decisions

Data warehouses (along with decision-support systems [DSSes] and client/server applications) were the first successes for the relational DBMS (RDBMS). While most operational systems were legacy applications mired in old data structures, warehouses and DSSes capitalize d on RDBMS for their flexibility and ability to perform ad hoc queries.

RDBMSes are most flexible when they are used with a normalized data structure. In a normalized database, data structures are nonredundant and represent the basic entities and relationships described by the data (e.g., products, markets, and sales transactions). But a typical on-line analytical processing (OLAP) query that crosses several structures ("How many blue widgets did we sell in stores in the Midwest?") requires several join operations to put such data back together. The performance of traditional RDBMSes is better for key-based queries ("Find customer account #1234") than for content-based queries ("Find all customers with income over $50,000 who have bought an automobile in the last six months").

To support large-scale warehouses and improve responsiveness for OLAP applications, vendors have added new features to the traditional RDBMS. These so-called super-relational features include support for specialized database har dware, such as the Teradata database machine. Super-relational features also support extensions to storage formats and relational operations (offered by vendors such as RedBrick) and specialized indexing schemes, such as those used by Sybase IQ. These techniques can improve performance for content-based retrievals by pre-joining tables using indexes or through the use of fully inverted index lists.

Most data-warehouse-access tools exploit the multidimensional nature of warehouse data. For example, marketing analysts need to look at sales volumes by product, by market, by time period, by promotions and advertising levels, and by combinations of these different aspects. Structuring data in a traditional relational database to facilitate queries and analyses along different dimensions (e.g., star or snowflake schemata) has become commonplace. These schemata might use multiple tables and pointers to simulate a multidimensional structure. Some DBMS products, such as Essbase and Gentium, implement storage tech niques and operators that support multidimensional data structures.

While multidimensional databases (MDDBs) help you directly manipulate multidimensional data objects (e.g., the easy rotation of data to view along different dimensions, or drill-down operations that successively expose more-detailed levels of data), you must identify these dimensions when building the database structure. Thus, adding a new dimension, or changing the views desired, can be cumbersome and costly. Some MDDBs require a complete reload of the database when restructuring occurs.

New Dimensions

A limitation of both RDBMS and MDDB is a lack of support for nontraditional data types, such as images, documents, and video/audio clips. If you need these types of objects in your data warehouse, look to an object-relational DBMS, such as Informix's Illustra. Focused on coded data values, most database systems can accommodate these data types only with somewhat-awkward reference-based extensions, such as pointers to files containing the objects. Many RDBMSes store complex data as binary large objects (BLObs). In this format, the objects cannot be indexed, sorted, or searched by the server.

Object-relational DBMSes, on the other hand, store complex data as native objects and can support the extensive data structures found in an object-oriented environment. These database systems naturally accommodate not only special data types but also the processing methods that are unique to each (e.g., a COMPARE operation on a picture might return another picture with similar features).

But a disadvantage to the object-relational approach is that the encapsulation of data within special data types -- a series of stock prices over time in each row of a stock table, for example -- requires specialized operators for what were previously simple searches (e.g., "Find all stocks that showed a decline in price from April to May 1996").

DBMS selection is also tied to the server hardware you use. Some RDBMSes, such as DB2 Parall el, Informix XPS, and Oracle Parallel, offer versions that support parallel operations. Parallel software splits queries, joins across multiple processors, and runs these operations simultaneously to improve performance.

Parallelism is required for the best performance on large MPP and clustered SMP servers. It's not yet an option with MDDBs or object-relational DBMSes.

The table "How DBMSes Compare" summarizes the pros and cons of the different types of DBMSes for warehouse operations. The table "Data-Warehouse Decision Matrix" contains some examples of how these decision criteria affect the choice of a server/data-warehouse architecture.

Piecing It Together

To select the right combination of server architecture and DBMS, you first need to understand your company's business requirements, its user population, and the skills of the support staff.

Data-warehouse implementations vary considerably in scope. Some are designed to suppo rt specific analysis needs for a single department or functional area of an organization, such as finance, sales, or marketing. Other implementations bring together data from across the entire enterprise to support a variety of user groups and functions. As a general rule, the broader the scope of the warehouse, the more power and functionality required of the server and DBMS.

Warehouse-usage patterns are also a factor. Prestructured queries and report views often satisfy informational users while exacting fewer demands on the DBMS and the processing power of the server. Complex analysis, which is typical of decision-support environments, requires more power and flexibility from all server components. Massive searches of large data warehouses favor parallelism in both DBMS and server.

Dynamic environments, with their ever-changing requirements, are served best by a simple, easily changeable data architecture (e.g., a highly normalized relational structure) rather than an intricate structure that requi res rebuilding after every change (e.g., a multidimensional structure). The required data-refresh rate indicates how responsive the data warehouse must be to new and changing data. Large data volumes that are refreshed at frequent intervals favor a physically centralized architecture to support efficient data capture and minimize data-transport time.

A user profile should identify who your data warehouse's users are, where they're located, and how many you need to support. Information about how each group expects to use warehouse data will help you analyze the various usage styles.

Knowing the physical location of your users helps you determine how and to what extent you need to distribute warehouse data. A tiered architecture might use servers already in place on LANs. Or you may need a centralized approach to support mobile workers who tap into the warehouse from their laptops.

The total number of users and their connection patterns determine the size of your warehouse servers. Memory sizes and I/O throughput must support the anticipated number of concurrent users under normal conditions as well as at peak usage times at your organization.

Finally, you must factor in support-staff sophistication. The IS resources that are available within your organization might limit the complexity or sophistication of your server architecture. Without skilled in-house staff or outside consultants, an architecture that requires server-platform parallelism (MPP or clustered SMP, for example) is difficult to create and maintain successfully.

Expansion Plans

As your warehouse matures and the data it contains becomes more accessible, employees outside the warehouse might also discover the value of its data. By linking your data warehouse to other systems -- both internal and external to the organization -- you can share information with other business entities with little or no custom development. E-mail messages, Web servers, and intranet/Internet connections can deliver inventory levels t o your suppliers or order status to your business partners.

As data warehouses continue to grow in sophistication and usability, the data accumulated within an enterprise will become more organized, more interconnected, more accessible, and more generally available to more employees. The result? Better business decisions, more business opportunities, and a more enlightened work force.


How DBMSes Compare

Feature/Function     Relational   Super-     Multidi-   Multidi-   Object-
                                Relational  mensional  mensional  Relational
                                            (logical)  (physical)
Normalized structures    y          y                                 y
Abstract data types                                                   y
Parallelism              y            
Multidimensional                    y           y          y   
 structures
Drill-down                                      y          y          y
Rotation
                                        y          y          y
Data-dependent                                                        y
 operations


Key

y = yes




Data-Warehouse Decision Matrix


For this environment ...

Business requirements        User population         Systems support
====================================================================
Scope: departmental          Small;                  Minimal local;   
Uses: data analysis          single location         average central      


Choose ...

Architecture                 Server                  DBMS
=========================================================
Consolidated;                Single-processor        MDDB
turnkey package              or SMP



For this environment ... Business requirements User population Systems support ==================================================================== Scope: departmental Large; analysts at Minimal local; Uses: analysis plus single location; average central informational informational users dispersed Choose ... Architecture Server DBMS ========================================================= Tiered; Clustered SMP RDBMS for detail at central; for central; central; MDDB summary at local SP or SMP for local for local
For this environment ... Business requirements User population Systems support ==================================================================== Scope: enterprise Large; geographically Strong central Uses: analysis plus dispersed informational Choose ... Architecture Server DBMS ========================================================= Centralized Clustered SMP Object-relational; Web support
For this environment ... Business requirements User population Systems support ==================================================================== Scope: departmental Small; few sites Strong central Uses: exploratory Choose ... Architecture Server DBMS ========================================================= Centralized MPP RDBMS with parallel support

Warehouse Foundation

illustration_link (20 Kbytes)

A centralized data-warehouse architecture provides efficiencies in both processing power and support costs.


Two Alternatives to a Centralized Architecture

illustration_link (25 Kbytes)


Server Architectures

illustration_link (34 Kbytes)


Jay-Louise Weldon (New York, NY) heads the Data Warehouse Practice within the U.S. Eastern region of MCI Systemhouse, a global-systems-integration firm. You can contact her at jweldon@shl.com .

Up to the State Of The Art section contentsGo to previous article: Go to next article: Better Clients, Better DecisionsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM   Copyright
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network