Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesAttention, Data-Mart Shoppers


July 1997 / Features / Attention, Data-Mart Shoppers

For access to your most valuable data, you can build data marts from either the bottom up or the top down.

Karen Watterson

Scenario #1: Your large corporation has an enterprise-wide data warehouse that contains all the information you need for your department -- somewhere. Problem is, to isolate what you need, you're doing more slicing and dicing than Julia Child, more pivoting than Nijinsky, and more diving than Jacques Cousteau. Plus, your network administrator is giving you dirty looks for all the back-and-forth to the production server you're using that everyone wants a piece of.

Scenari o #2: Y our department or small firm would love to leverage the data in your database. But the price tag and effor t of setting up a data warehouse keep you at the yearning stage.

Data warehousing is hot today because businesses realize that they can use information as a competitive weapon. The idea is to consolidate and aggregate data from production systems into centralized or distributed data warehouses or data marts, where users can get at it. Users can then use the data they obtain to, say, provide better customer service, do better analysis in a more timely fashion, or look at data in new ways that let them spot otherwise unknown problems or opportunities.

Two other compelling incentives for creating data warehouses and data marts are the ability to do data mining and then offering the resultant warehoused data over the Internet. The good news is that a data mart can meet the needs from either end: local access to pertinent data, for a cost and an amount of effort that won't kill you.

Warehouses vs. Marts

Although word is spreading that data marts are definitely the way to go -- they're less expensive and easier to build than enterprise data warehouses -- the terminology itself can be an emotional issue (see the table "Data Mart or Data Warehouse?" ). When Bill Inmon, cofounder of Prism Solutions and now head of his own data-warehousing company, Pine Cone Systems, first defined a data warehouse in 1990 as a "subject-oriented, integrated, time-variant, and nonvolatile collection of data supporting management's decisions," his vision of data warehouses was on an enterprise scale.

Today's more process-oriented definitions focus on warehousing rather than on warehouses or marts. Warehousing refers to a set of processes or an architecture that merges related data from many operational systems to provide an integrated view of data that can span multiple business divisions.

Data marts, on the other hand, tend to be subject- or department-oriented. They can be subsets of a larger warehouse (in which case they are called dependent data marts), but that's not a requirement, and many independent, "stovepipe" data marts exist. You can't even define data marts in terms of size -- they aren't necessarily smaller than corporate warehouses. It's entirely possible, for example, for business-unit analysis to require so much historical depth that it's larger than the summary-level data available in the corporate warehouse.

The data-warehouse-versus-data-mart debate is one of those issues for which there's no simple answer. Some organizations insist that data marts are subsets of data warehouses and that you can't (and shouldn't) have any data marts until you pay your dues and create an enterprise data warehouse (and data model). Others say that, in today's fiercely competitive environment, you'd be crazy to embark on a multiyear warehousing project as a prerequisite to deploying any data marts. They believe it makes more sense to simply align a data-mart pilot project with an organization's core competency -- the one that accounts for 90 percent of the firm's profits -- and go for the more immediate results.

And given the relatively modest price tags of data marts, doesn't it make sense to view your first data mart as a throwaway? Fred Brooks, in his classic Mythical Man Month , says that "where a new system concept or new technology is used, one has to build a system to throw away, for even the best planning is not so omniscient as to get it right the first time." In other words, you should actually plan to throw one away -- chances are you will anyhow.

Thus, a data mart is similar to a data warehouse that contains all of an organization's data, but it's more limited in scope. It typically focuses on the needs of a specific business unit or function and is less expensive and faster to implement than an enterprise-wide data warehouse. On the other hand, although data marts are less expensive, easier t o start (especially given all the 30- to 90-day quick-start bundles), and often offer better performance than a gigantic data warehouse, they're often harder to scale up.

The Three M's: Models, Methodologies, & Metadata

Still, there's a powerful incentive to get things right the first time. Where do you start? For simple data marts with few data sources, you might want to consider your primary relational database vendor, especially if the source data resides in an IBM, Informix, Oracle, or Sybase database already. Each of those vendors offers data-mart bundles for less than $100,000, scaling from Windows NT up to massively parallel processing (MPP) systems (see the table "Representative Vendors with Data-Mart Packages" ).

However, the majority of data-warehouse and data-mart projects start with meetings that attempt to define scope, products, platforms, time lines, and budgets all at once. Sometimes it helps to have a me thodology to follow. Most vendors offer some sort of methodology, perhaps in the guise of industry-specific templates or expertise, or their own best practices, but a few generic methodologies have also emerged.

Some methodologies, like Earl Hadden's, have even been licensed by data-mart vendors. Hadden's Data Warehouse Framework, based on John Zachman's framework for systems architecture, consists of a matrix with rows for owners, architects, designers, and builders, and columns for data, process, location, organization, event, and business driver. The Hadden Data Warehouse Method consists of three basic stages: architect, implement, and operate/enhance. Another well-known methodology is Prism's ( http://www.prismsolutions.com ) Iterations.

The architecture stage is typically the hardest, since it consists of identifying data sources, constructing a data model at both the conceptual and logical levels, and mapping the production data into the data warehouse or mart (i.e., constructing the physical model). Not surprisingly, traditional data-model vendors are beginning to offer help in designing data warehouses and marts. Powersoft's ( http://www.powersoft.com ) Power-Designor (formerly called S-Designor), for example, includes a new WarehouseArchitect module for designing and building Sybase IQ data warehouses and data marts. Cayenne Software ( http://www.cayennesoft.com ) is another data-modeling vendor that offer s help in designing data warehouses and data marts.

If you've heard about star schemas, on-line analytical processing (OLAP), or multidimensional database engines, you probably wonder where they fit into the data model. The term OLAP was coined by Ted Codd and Chris Date. Relational on-line analytical processing, or ROLAP, refers to engines and tools that add multidimensionality to data in relational database management systems (RDBMSes), using techniques such as star schemas.

The term star schemas refers to special database designs that make relational databases suitable for ad hoc slice-and-dice queries. Star schemas consist of fact and dimension tables, the idea being to mimic a multidimensional database with views of data (e.g., sales by region or time) that are likely to be explored. Snowflaking refers to splitting up the dimensions (i.e., decomposing them). Both star schemas and snowflaking result in databases that are denormalized and consist of redundant and precalcula ted data. Denormalization, a bad word in the production database arena, results in faster performance in data warehousing.

All warehouse products stash their metadata somewhere -- often in the DBMS itself -- but there aren't any widely accepted formats yet for metadata (data about data) interoperability and exchange. Fortunately, there are efforts currently under way (see http://www.metadata.org ) to define such standards -- something that will become increasingly useful as organizations seek to consolidate independent data marts, integrate business-to-business data sharing over the Internet, and handle "inherited" data marts and data warehouses that result from mergers or acquisitions. As you'd expect, the growing field of data-warehouse management tools focuses on metadata.

It should be obvious, but th e more time you spend on doing a good, centrally administered data model, the easier it will be to deploy additional data marts and consolidate them into a consistent data warehouse -- if that's your organization's ultimate goal. The danger with deploying multiple independent data marts with inconsistent data models is that not only must they be independently refreshed, but you'll also have to maintain multiple sets of probably inconsistent metadata.

Process Steps: Data Cleansing and Loading

You might think that doing the data modeling and mapping is the hardest part of building a data warehouse or mart. But more often than not, the data-cleansing step is even more time consuming. The sad truth is that most corporate data contains not only erroneous data, but also data with inconsistencies that are extremely hard to resolve.

One of the hardest things to accept when building a data warehouse or mart is deciding how clean is clean enough; that's something that varies, depending on what the data warehouse or mart is to be used for. Some implementers simply rely on the classic "80/20" rule. Although the idea of 20 percent "bad" data might shock you, it's not uncommon for real data warehouses to start with 40 percent problematic data. Keep in mind also that data cleansing isn't a one-time deal. After all, whenever you refresh, or update, your data warehouse or mart, you have to go through the same data-cleansing steps to maximize data quality.

Like data cleansing, data loading is something that happens more than once. It's also something that you have to monitor, because it can impact operational systems (after all, you use production-server CPU cycles to copy data out to the data warehouses or marts) and the network traffic. Users want data that's as up-to-date as possible, but network managers and on-line transaction processing (OLTP) system managers undoubtedly encourage moderation. It takes a lot of bandwidth to maintain near-real-time data marts.

If you think about how data marts and warehouses get loaded up with data -- hourly, daily, or weekly -- you can see how fast they can grow. That's one reason why it's smart to have an upward migration plan in hand from the onset. You'll probably also have to archive old data-mart or warehouse data, or simply decide to keep only the most recent data -- say, three months' worth -- in your system.

By now it should be clear that building a data warehouse or mart isn't easy and that it's an ongoing task. Users want more data, faster response, and perhaps even increasingly powerful tools as they become more adept at slicing, dicing, and otherwise analyzing your organization's data. And sooner or later, someone will want to Web-enable your data warehouse or mart. Next aisle, please.


Data Mart or Data Warehouse?

Data Mart or Data Warehouse?
Data mart Data warehouse
Size Averages 50 GB; possibly as large as 250 GB Often terabyte +
Purpose Subject-area-specific Enterprise repository
How initiated Bottom-up Top-down, along with formed enterprise data model
Control Departmental IS
Time line 3 to 6 months 1 to 2 years
Cost $10,000 to $1,000,000 Millions of dollars


Representative Vendors with Data-Mart Packages

Representative Vendors with Data-Mart Packages
Vendor Package Starting price Web address
Arbor Software Essbase $37,000 http://www.arborsoft.com
Digital Equipment Corp. Digital Data Mart (uses Informatica's PowerMart) $99,500 http://www.digital.com
Fiserv, Inc. InformEnt (for banks) Client-specific http://www.fiserv.com/inform
IBM Corp. Visual Warehouse for Windows NT $22,500 http://www.software.ibm.com
IBM Corp. AS/400 $7995 http://www.as400.ibm.com
Informatica PowerMart $50,000 http://www.informatica.com
Information Advantage DecisionSuite $29,995 http://www.infoadvan.com
Information Builders, Inc. SmartMart $55,000 http://www.ibi.com
Informix Software Informix FastStart Data Mart (includes MetaCube) $62,500 (Informix Online Work-group Server alone starts at $1475) http://www.informix.com
JDA Software Group, Inc. Retail Ideas (AS/400 solution) Varies http://www.jda.com
Kenan Technologies Acumate ES $25,000 http://www.kenan.com
Microsoft Corp. SQL Server $1499 http://www.microsoft.com
MicroStrategy DSS Server $50,000 http://www.strategy.com
Oracle Oracle Server for NT/Oracle Express Server for NT $1475/$3995 http://www.oracle.com
Prism Solutions Prism Scalable Data Mart $70,000 http://www.prisminc.com
Red Brick Red Brick Warehouse $5000 http://www.redbrick.com
Sagent Technology Sagent Data Mart Solution $50,000 http://www.sagenttech.com
SAS Rapid Warehouse Assessment, SAS/Warehouse Administrator $15,000 http://www.sas.com
Seagate/Holistic Systems Holos $88,000 http://www.holos.com
Sequent Computer Systems CompleteWarehouse $500,000 (hardware and software) http://www.sequent.com
Siemens Pyramid Information Systems Smart Warehouse $25,000 http://www.siemens-pyramid.com
Software AG Dmart and SourcePoint $25,000 http://www.sagus.com
Sybase WarehouseNow, formerly called QuickStart DataMart (uses Informatica's PowerMart) $95,000 http://www.sybase.com
Virtual Integration Technology designManager and deliveryManager $50,000 per server http://www.vit.com
VMark Software DataStage $37,500 http://www.vmark.com


Data Marts


If you  want a simple data mart and have few data sources:


* Consider your primary relational database vendor, especially if the
   desired data is already on the database. IBM, Informix, Oracle, and
   Sybase offer low-price, scalable data marts.
* Look at other vendors' packages.
* Goals: simplicity, compatibility, low expense, expandability.


If you're building from scratch:


* Consider a vendor-supported methodology.
* Create an architecture, data models, and data mapping.
* Goals: low cost, suitable to task.


...or from the top down


If you have an enterprise-wide data warehouse:

* Choose data appropriate to your task.
* Set up a loading data mart with a subset of the data warehouse.
* Goals: Use an existing system, and select data that you can use.




Data-Marts Road Map

illustration_link (66 Kbytes)


Karen Watterson (San Diego, CA) is an independent writer and consultant who specializes in client/server and data-warehousing issues. You can contact her by sending e-mail to 1119390@mcimail.com .

Up to the Features section contentsGo to next article: Additional ResourcesSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network