ll warehouse architecture, the server architectures, and DBMSes. Here are some guidelines for making the right choices for your enterprise.
The Blueprint
Data-warehouse development begins with the logical and physical structure of the warehouse database plus the services required to operate and maintain it. This choice drives the selection of the other two fundamental items: the server hardware and the DBMS.
The physical platform can be centralized in one location or distributed regionally, nationally, or internationally. One scheme for housing your company's data, which might be gathered from multiple internal and external sources, is to consolidate the database in one integrated data warehouse (see the figure
"Warehouse Foundation"
). The consolidated approach helps to ma
ximize your available processing power. A second approach, the federated architecture, distributes information by function, with financial data on a server at one site, marketing data at another location, and manufacturing data at a third venue (see the figure
"Two Alternatives to a Centralized Architecture"
).
The third approach, a tiered architecture, houses highly summarized data on one user's workstation, with more-detailed summaries on a second server, and the most detailed information on a third. The first-tier workstation handles most requests for data, with successively fewer requests passing on to tiers 2 and 3 for resolution. Computers at the first tier can be optimized for heavy user loads and low data volume, while servers at the other tiers are more adept at processing heavy data volumes but lighter user loads.
As you're deciding on a centralized or distributed warehouse structure, you also need to consider the servers that will hold and deliver the data. The size of
your implementation -- and your enterprise's needs for scalability, availability, and systems management -- all influence your choice of server architecture (see the figure
"Server Architectures"
).
Single-processor servers are the easiest to manage, but they offer limited processing power and scalability. In addition, a single server presents a single point of failure, limiting the guaranteed availability of the warehouse. You can expand single-server networks via distributed architectures that make use of middleware, such as Distributed Computing Environment (DCE) or Common Object Request Broker Architecture (CORBA), to distribute traffic across multiple servers. These architectures increase availability as well, since operations can be shifted to a backup server if one server fails, but systems management is more complex.
Symmetric multiprocessing (SMP) machines increase throughput by adding processors that share the server's internal memory and disk-storage devices. You can bu
y most SMP in minimal configurations (i.e., with two processors) and upgrade when necessary to handle growth in your processing needs. The scalability of an SMP machine reaches its limit at the maximum number of processors supported by the connection mechanisms (i.e., the backplane and the shared bus).
A massively parallel processing (MPP) machine connects an array of processors via a high-speed, high-bandwidth link. Each node is a server, complete with its own (possibly SMP) processor and internal memory. To optimize an MPP architecture, applications must be "parallelized" (i.e., designed to operate in separate, parallel pieces).
This architecture is ideal for searching large databases. However, the DBMS that you select must be one that offers a parallel version. And even then, substantial design and tuning are required to achieve optimum data distribution and to prevent hot spots or
data skew
(where a disproportionate amount of the processing is shifted to one processing node due to the par
tition of data under its control).
The difficulty of moving applications and DBMSes to clustered or truly paral-lel environments has led to newer architectures, such as nonuniform memory access (NUMA). NUMA creates a single, large SMP machine by connecting multiple SMP nodes into a single (though physically distributed) memory pool and a single instance of the OS. NUMA allows the SMP approach to achieve the performance benefits of large MPP machines (with 32 or more processors) while maintaining the simplicity and management advantages of a standard SMP environment. Most appealing of all, existing DBMSes and applications can be moved from single-processor or SMP platforms to NUMA with no modifications.
DBMS Decisions
Data warehouses (along with decision-support systems [DSSes] and client/server applications) were the first successes for the relational DBMS (RDBMS). While most operational systems were legacy applications mired in old data structures, warehouses and DSSes capitalize
d on RDBMS for their flexibility and ability to perform ad hoc queries.
RDBMSes are most flexible when they are used with a normalized data structure. In a normalized database, data structures are nonredundant and represent the basic entities and relationships described by the data (e.g., products, markets, and sales transactions). But a typical on-line analytical processing (OLAP) query that crosses several structures ("How many blue widgets did we sell in stores in the Midwest?") requires several join operations to put such data back together. The performance of traditional RDBMSes is better for key-based queries ("Find customer account #1234") than for content-based queries ("Find all customers with income over $50,000 who have bought an automobile in the last six months").
To support large-scale warehouses and improve responsiveness for OLAP applications, vendors have added new features to the traditional RDBMS. These so-called super-relational features include support for specialized database har
dware, such as the Teradata database machine. Super-relational features also support extensions to storage formats and relational operations (offered by vendors such as RedBrick) and specialized indexing schemes, such as those used by Sybase IQ. These techniques can improve performance for content-based retrievals by pre-joining tables using indexes or through the use of fully inverted index lists.
Most data-warehouse-access tools exploit the multidimensional nature of warehouse data. For example, marketing analysts need to look at sales volumes by product, by market, by time period, by promotions and advertising levels, and by combinations of these different aspects. Structuring data in a traditional relational database to facilitate queries and analyses along different dimensions (e.g., star or snowflake schemata) has become commonplace. These schemata might use multiple tables and pointers to simulate a multidimensional structure. Some DBMS products, such as Essbase and Gentium, implement storage tech
niques and operators that support multidimensional data structures.
While multidimensional databases (MDDBs) help you directly manipulate multidimensional data objects (e.g., the easy rotation of data to view along different dimensions, or drill-down operations that successively expose more-detailed levels of data), you must identify these dimensions when building the database structure. Thus, adding a new dimension, or changing the views desired, can be cumbersome and costly. Some MDDBs require a complete reload of the database when restructuring occurs.
New Dimensions
A limitation of both RDBMS and MDDB is a lack of support for nontraditional data types, such as images, documents, and video/audio clips. If you need these types of objects in your data warehouse, look to an object-relational DBMS, such as Informix's Illustra. Focused on coded data values, most database systems can accommodate these data types only with somewhat-awkward reference-based extensions, such as pointers
to files containing the objects. Many RDBMSes store complex data as binary large objects (BLObs). In this format, the objects cannot be indexed, sorted, or searched by the server.
Object-relational DBMSes, on the other hand, store complex data as native objects and can support the extensive data structures found in an object-oriented environment. These database systems naturally accommodate not only special data types but also the processing methods that are unique to each (e.g., a COMPARE operation on a picture might return another picture with similar features).
But a disadvantage to the object-relational approach is that the encapsulation of data within special data types -- a series of stock prices over time in each row of a stock table, for example -- requires specialized operators for what were previously simple searches (e.g., "Find all stocks that showed a decline in price from April to May 1996").
DBMS selection is also tied to the server hardware you use. Some RDBMSes, such as DB2 Parall
el, Informix XPS, and Oracle Parallel, offer versions that support parallel operations. Parallel software splits queries, joins across multiple processors, and runs these operations simultaneously to improve performance.
Parallelism is required for the best performance on large MPP and clustered SMP servers. It's not yet an option with MDDBs or object-relational DBMSes.
The table
"How DBMSes Compare"
summarizes the pros and cons of the different types of DBMSes for warehouse operations. The table
"Data-Warehouse Decision Matrix"
contains some examples of how these decision criteria affect the choice of a server/data-warehouse architecture.
Piecing It Together
To select the right combination of server architecture and DBMS, you first need to understand your company's business requirements, its user population, and the skills of the support staff.
Data-warehouse implementations vary considerably in scope. Some are designed to suppo
rt specific analysis needs for a single department or functional area of an organization, such as finance, sales, or marketing. Other implementations bring together data from across the entire enterprise to support a variety of user groups and functions. As a general rule, the broader the scope of the warehouse, the more power and functionality required of the server and DBMS.
Warehouse-usage patterns are also a factor. Prestructured queries and report views often satisfy informational users while exacting fewer demands on the DBMS and the processing power of the server. Complex analysis, which is typical of decision-support environments, requires more power and flexibility from all server components. Massive searches of large data warehouses favor parallelism in both DBMS and server.
Dynamic environments, with their ever-changing requirements, are served best by a simple, easily changeable data architecture (e.g., a highly normalized relational structure) rather than an intricate structure that requi
res rebuilding after every change (e.g., a multidimensional structure). The required data-refresh rate indicates how responsive the data warehouse must be to new and changing data. Large data volumes that are refreshed at frequent intervals favor a physically centralized architecture to support efficient data capture and minimize data-transport time.
A user profile should identify who your data warehouse's users are, where they're located, and how many you need to support. Information about how each group expects to use warehouse data will help you analyze the various usage styles.
Knowing the physical location of your users helps you determine how and to what extent you need to distribute warehouse data. A tiered architecture might use servers already in place on LANs. Or you may need a centralized approach to support mobile workers who tap into the warehouse from their laptops.
The total number of users and their connection patterns determine the size of your warehouse servers. Memory sizes and
I/O throughput must support the anticipated number of concurrent users under normal conditions as well as at peak usage times at your organization.
Finally, you must factor in support-staff sophistication. The IS resources that are available within your organization might limit the complexity or sophistication of your server architecture. Without skilled in-house staff or outside consultants, an architecture that requires server-platform parallelism (MPP or clustered SMP, for example) is difficult to create and maintain successfully.
Expansion Plans
As your warehouse matures and the data it contains becomes more accessible, employees outside the warehouse might also discover the value of its data. By linking your data warehouse to other systems -- both internal and external to the organization -- you can share information with other business entities with little or no custom development. E-mail messages, Web servers, and intranet/Internet connections can deliver inventory levels t
o your suppliers or order status to your business partners.
As data warehouses continue to grow in sophistication and usability, the data accumulated within an enterprise will become more organized, more interconnected, more accessible, and more generally available to more employees. The result? Better business decisions, more business opportunities, and a more enlightened work force.
Feature/Function Relational Super- Multidi- Multidi- Object-
Relational mensional mensional Relational
(logical) (physical)
Normalized structures y y y
Abstract data types y
Parallelism y
Multidimensional y y y
structures
Drill-down y y y
Rotation
y y y
Data-dependent y
operations
Key
y = yes
For this environment ...
Business requirements User population Systems support
====================================================================
Scope: departmental Small; Minimal local;
Uses: data analysis single location average central
Choose ...
Architecture Server DBMS
=========================================================
Consolidated; Single-processor MDDB
turnkey package or SMP
For this environment ...
Business requirements User
population Systems support
====================================================================
Scope: departmental Large; analysts at Minimal local;
Uses: analysis plus single location; average central
informational informational users
dispersed
Choose ...
Architecture Server DBMS
=========================================================
Tiered; Clustered SMP RDBMS for
detail at central; for central; central; MDDB
summary at local SP or SMP for local for local
For this environment ...
Business requirements User population Systems support
====================================================================
Scope: enterprise Large; geographically Strong central
Uses:
analysis plus dispersed
informational
Choose ...
Architecture Server DBMS
=========================================================
Centralized Clustered SMP Object-relational;
Web support
For this environment ...
Business requirements User population Systems support
====================================================================
Scope: departmental Small; few sites Strong central
Uses: exploratory
Choose ...
Architecture Server DBMS
=========================================================
Centralized MPP RDBMS with
parallel support
illustration_link (20 Kbytes)

A centralized data-warehouse architecture provides efficiencies in both processing power and support costs.
illustration_link (25 Kbytes)

illustration_link (34 Kbytes)

Jay-Louise Weldon (New York, NY) heads the Data Warehouse Practice within the U.S. Eastern region of MCI Systemhouse, a global-systems-integration firm. You can contact her at
jweldon@shl.com
.