A data warehouse does more than collect data. It reflects a valid and consistent image of the business.
Richard Hackathorn
You gaze with pride at your latest handiwork, the company's new data warehouse. At 11:35, the phone rings. You scoop it up. The voice on the other end says, "This is Marlena Price, VP of finance. I just got the first reports from your new warehouse system. They look nice, but the numbers that you generated don't jibe with our standard re
ports. Would you meet with me after lunch to explain?"
Your blood runs cold. Calmly, you reply, "Sure. Where exactly are you seeing problems?" She explains that some of the regional reports have questionable results. You ask her a few more questions and then hang up.
Next you dial Eric, the data administrator. "We've got a problem," you begin. "The VP of finance is questioning our reports."
The popularity of data warehousing (DW) has, unfortunately, obscured its basic objectives. Many people construct DW systems for simplistic reasons and with unrealistic expectations. Neglecting to establish and maintain credibility in the reported data until it's too late has caused many DW efforts to fail in their early stages. To deal with this credibility crisis, you must grapple with one big question: Does the data warehouse convey a consistent and valid image of your business operations?
A Constant Struggle
The term
data warehous
ing
emphasizes an overall process and architecture, while the terms
data warehouse
and
data mart
refer to a collection of data managed within a database. A data mart is a data warehouse that deals with a specific organizational function. In this context, DW is the discovery, management, and dissemination of informational data. This data, which is derived from various sources, provides a consistent and valid image of the business that it documents.
Maintaining this image presents a constant struggle of integrating an ever-changing variety of models that attempt to describe the business. Because of this, you must confront validity issues right from the start when constructing a data warehouse.
First, there's the challenge of understanding the warehouse's data properly from a business perspective. The ability to audit any data element is an essential for instilling confidence in the data. Second, there's the challenge of consistency in data usage. Two users might think they're deba
ting over similar data elements when in fact the elements indicate different aspects of the business, either because of naming inconsistencies or nonsynchronized data. Third, there's the problem of properly building SQL queries. Slight variations in SQL syntax can produce widely differing results from the same set of data. Most of SQL's syntax is poorly understood, requiring an expert to use it properly. Fourth, there's the challenge of expanding access to the data warehouse to people who might be unfamiliar about the precise business interpretation of a specific data element. Subsequent analyses based on such a data element can easily lead to erroneous business decisions. Finally, there's the ever-present demand of rapidly inventing new queries and reports as business requirements change.
Eric taps furiously at a workstation's keyboard. "Here's the drill-down program I whipped up when it took us forever to track down that problem with the currency conversions in the overseas reports," he says. "You
enter SQL queries here, and the results are colored-coded according to tolerances that you set here," he adds, pointing at a pane on the screen. You nod, sit at another workstation, start the program, and type in a SQL query. Rows of data begin to scamper up the screen. You roll your chair to another workstation, and soon it, too, has data streaming up-screen, the result of a slightly different SQL query.
On one of the workstations, several rows of data glow angry red, and the display halts. Eric clicks on a field in the offending row, and a pop-up window displays an exhaustive trace of the data's origins. You jab at the window: "Got it." You type in more commands, and paper starts coming out of a printer.
Checks for Validity
To ensure that your data warehouse provides a consistent and valid image of business reality, several types of validity checks must be designed into the system from the beginning. You must be able to audit the contents and structure of data flows. Such checkpoint
s must be located at various points in the flows from data sources to data consumption, as shown in the figure
"Validity Checks for Data-Warehouse Flows"
.
There should be a
uniformity check
that ensures data values are within prescribed limits. This should be a basic component when acquiring new data for the warehouse. Based on predefined or statistically set parameters, filters should detect whether data values are within expected limits and also report any exceptions. The uniformity check occurs during in-flow, where data creation or capture flows into the warehouse. (For a more-detailed description of in-flow, see the glossary below.) The in-flow consists of a sequence of steps from data capture to validation. As the figure shows, these steps might require repair and transformation operations before the data is loaded into the database.
A
version check
is similar to the uniformity check. It also occurs during the in-flow and detects changes in the meta-data s
pecifications. For example, an unexpected change in the encoding of data values, such as a shift from a 1 or 2 encoding to
M
or
F
, should be caught during comparisons with the metadata specification. Other examples include new or missing columns in relational tables and an increased width in a fixed field size.
Situations that are difficult to catch include changes in the timing of data capture, such as one region reporting on a calendar year while another uses the fiscal year. Many exceptions to the version check imply the need for new transformation routines during data loading and possibly the restructuring of the warehouse itself.
A
completeness check
determines if aggregations of data are complete and correct. It occurs during data analysis and reporting in the up-flow, where value is added through summarization (the aggregations) and packaging (e.g., spreadsheets). Aggregations are useful but may hide (i.e., destroy) important data. For example, an average sales figure
may be misleading if values are missing (i.e., null) from critical sales regions because of data-collection problems. A completeness check is more difficult if the aggregation is a complex procedure rather than a simple average or total.
A
conformity check
does proper correlation of data with standard sources. It occurs during data analysis and reporting and validates that the data conforms with other data sources and reports. An important use for the conformity check is the correlation between key financials reported to the IRS and the SEC and various internal indicators. Another use for some companies is a historical correlation in sales for various regions. A sudden change in this correlation could indicate a fundamental shift in the business or simply bad data collection or faulty analysis.
Finally, when all else fails, a
genealogy check
provides a complete audit trail to the data source. It occurs during data reporting when the consumer of the information questions the validity
of the data. This check should generate a trace of the information through its transformations and back to its sources. With on-line analytical processing (OLAP) tools, a genealogy check on data values is referred to as a "drill-down" upon a specific aggregation so that data values at a finer granularity can be viewed. Likewise, you should be able to perform a drill-down into the metadata all the way back to the source.
Instilling Confidence
Ms. Price looks closely at the printouts, then at you. You explain, "Over the years, we've added new products and removed obsolete ones. Furthermore, some of the accounting procedures have been revised. This is especially evident in the overseas results, which require different recording and reporting of sales. Each division developed its own ad hoc solution for these changes, but we had to set up some new calculations to correct for these situations." Because of your quick and thorough explanation of the discrepancy, the VP of finance becomes an ally o
f yours.
Many well-executed DW efforts fail simply because the implementers neglect to establish and maintain confidence in the generated data. The basic issue with any DW system is: Does the warehouse's information represent a consistent and valid image of business reality?
Every manager of a DW effort should be equipped to answer this question easily and quickly. The alternative is a continuing series of credibility crises that will tarnish your DW efforts.
In-flow: Describes the data flow from creation or capture into
the data warehouse. Data sources can be systems that provide internal
information on business operations (e.g., sales) plus various
external sources (e.g., e-mail and the Internet).
Up-flow: Involves summarizing data into a form relevant to
users. Through the use of projections, aggregation functions, and
groups, detailed data is refined and packaged into views that focus
on specific business problems.
Metadata: Deals with the modeling and control of data in the
warehouse. It handles system modeling, which is the technical
analysis of those systems that act as data sources. Metadata is also
the analysis of usage patterns, which guides the design of
information objects, which in turn models business processes.
illustration_link (26 Kbytes)

Information entering the data warehouse must be monitored for consistency and completeness.
Richard Hackathorn is president of Bolder Technology, Inc. (Boulder, CO), a company specializing in enterprise connectiv
ity and data warehousing. You can reach him by sending e-mail to
richardh@bolder.com
or to his company's site at
http://www.bolder.com/
.