associated with data warehousing, and take a look into the future.
Success Story
Prior to the data warehouse, when there were only legacy, operational applications, integration of data and information was only a dream. Each application had its unique view of who a customer was, what a product was, and what an order was. No two applications necessarily agreed on anything, and a corporate perspective of information was virtual fiction. In addition, legacy applications looked at and contained only very current data. Historical data didn't exist in any organized manner. And summary data was never anything but a very small part of the operational environment.
Data warehouses squarely address these inadequacies of the operational environment by integrating data, providing historical data, and providing detailed as well as summary data.
Integrated Data.
Data in a warehouse is integrated, so it can support a corporate perspective of
data. With a warehouse, an executive can immediately look at corporate information. The integration required to build a data warehouse is not an easy process. John Ladley of the Meta Group states that integration takes up to 75 percent of the development dollars for the building of the data warehouse. Integration is complex, painful, and requires much thinking -- but it pays off handsomely.
Historical Data.
A second fundamental characteristic of data warehousing is that the warehouse contains historical data. Typically, data warehouses contain five to 10 years of such accumulated data.
By looking across multiple quarters, corporations can start to see the forest from the trees. They can start to understand the seasonality of their business across multiple years. In understanding their seasonality, corporations can tell whether they are truly making progress or merely marking time.
Detailed and Summary Data.
Summary data is important to management because management ne
eds to see the larger picture before it can concentrate on the interesting details. In many regards, detailed data merely hides information that management will find interesting.
Two Warehouse Mutants
There are several notable mutant forms of a data warehouse. One form is the operational data store (ODS) -- a data warehouse in the operational environment. An ODS shares many characteristics of a data warehouse but has unique characteristics. For example, an ODS can be updated and can provide fast transaction response time, something a classic data warehouse cannot do.
Another mutant form of data warehouse is the data mart. A data mart is a departmentalized form of a data warehouse. They are similar to data warehouses except data marts:
- are customized for a single department;
- contain less historical data than a data warehouse;
- operate on technologies that are suitable for a much lower volume of data than a data warehouse;
- have many more indexes than a warehouse, so the optimal data structure for a data mart is the star join (for the more voluminous data warehouse, a more normalized structure is appropriate);
- have a fairly predictable usage pattern;
- contain much more summary data;
- look very different from department to department;
- hold very little, if any, detailed data.
A Brief History
Data warehousing rose from the origins of the notion that there should be a split between different types of databases. There are eight main reasons that it was a good idea to split processing across multiple databases.
- Operational databases require split-second response time, whereas data warehouses and decision support systems (DSSes) processing do not.
- The clerical community uses transaction-oriented databases; the managerial community uses data warehouses.
- Up-to-the-second decisions are made from operational systems; long-term decisions are made from data warehouses.
- Operational databases contain very current information; data warehouses contain historical information.
- Operational databases are unintegrated and are often application-specific; warehouses contain integrated data.
- Operational databases are designed for detailed data; warehouses are designed for detailed and summary data.
- In an operational environment, the requirements for processing are known before the system is built; in a data warehouse environment, the requirements for processing are discovered as part of the development process.
- Requirements for processing in an operational environment are static; requirements for processing in a data warehouse environment are heuristic and are discovered through iterative development.
From the notion that there should be a single database to serve the corporation's needs came the notion that separate and distinctly different database types were needed to serve the corporation's needs.
Creating the Warehouse
While the notion of data warehousing was enormously app
ealing, the first issue the corporation faced was that of creating the warehouse from the data found in the legacy, operational environment. At first, people thought the problem was as simple as moving data from an operational platform to a data warehouse platform. They thought that replication of data was all they needed. While data certainly needed to be moved, the data required integration and transformation during the moving process. Very quickly integration and transformation technology appeared. Sophisticated shops discovered that there was no need to have many programmers manually creating the code needed to integrate the data.
The next phase in
the evolution
of the data warehouse environment was the advent of the data mart. Data marts have always been a part of the DSS architecture. The earliest manifestation of the data warehouse/data mart architecture was in a form that can be called the independent data mart: The data mart is created directly from the legacy, operational
applications. There is no data warehouse where there are independent data marts. Independent data marts are appealing because they can be built cheaply, quickly, and simply.
For a while, independent data marts were very popular. But they reached a point where several major architectural flaws appeared. When a corporation built more than one independent data mart, there was massive redundancy of data (primarily detailed data) from one independent data mart to another. Also, the number of interface programs from the independent data marts back to the legacy, operational application environment grew exponentially. There was also no single corporate "source of truth"; as a consequence, different departments were saying quite different things about the same data based on analysis obtained from their independent data mart. Finally, the machine resources required for extracting legacy, operational data from the same application by each independent data mart grew intolerable.
In a word, organizations tha
t built a series of independent data marts simply didn't get their money's worth from data warehousing. Independent data marts are not the solution to the corporate information problem. After a short period, data architects perceived that
dependent
data marts were the proper architecture.
In a dependent data mart architecture, there is a central corporate data warehouse that feeds the dependent data marts. This architecture is sometimes called the
hub-and-spoke
architecture, where the data marts are the spokes and the data warehouse is the hub. The hub- and-spoke architecture has much to commend itself. For one, there is integration of data and reconcilability of data at the hub. In addition, there is autonomy of processing at the spoke and no necessary redundancy of data at the spokes. And there is a rich amount of history at the hub.
The general patterns of
database design
have mimicked the evolution and sophistication of the data warehouse/DSS environment
. In the early days, when there was a nascent data warehouse, classic data normalization was the basis for design. As independent data marts emerged, star joins and snowflake structures became the norm for design. And as the hub-and-spoke architecture evolved, normalized, data-model based designs for the hub and star join became the norm, as did snowflake designs for the spokes.
Data Mining
Once the warehouse and its architectural components are built, the next part involves exploiting the warehouse. Data mining is the next logical step in completing the circle of effective decision support. With data mining, you can discover important business patterns, examine relationships between obscure and otherwise unnoticed variables, and measure long-term trends. In short, data mining fulfills many of the expectations of data warehousing.
An interesting question that almost immediately arises is: Can data mining be done without building a data warehouse? Does a corporation really have to go throu
gh the effort and investment of building a warehouse in order to use data mining technology successfully?
The answer is that data mining can be done with no data warehouse or data marts at all. But just because data mining can be done does not mean that data mining can be done effectively. Can data mining be done
effectively
with no data warehouse/DSS infrastructure? The answer is that data warehousing is absolutely essential for effective data mining.
Why is a data warehouse/DSS infrastructure essential for corporations that are serious about data mining? Simply stated, warehouses prepare the raw data of the corporation for mining analysis in an optimal manner. This preparation before analysis shows up very beneficially in many ways.
One of the essences of a data warehouse is that data is integrated as it is placed in the warehouse. This means there must be uniformity and continuity to the understanding of common corporate objects, such as who is a customer, what is a transaction,
and so forth. By building the warehouse first, the data miner can dive into the analysis immediately and can start to achieve results immediately. But if the data miner does not have a data warehouse to operate from, then the miner must spend precious time gathering, cleansing, scrubbing, and integrating the data. It will be a long time until the miner is set to even start the analysis portion of data mining if there is no warehouse infrastructure.
The second reason the warehouse sets the stage for success in data mining is that it collects and organizes historical data. The data miner needs a wealth of historical data in order to find the patterns and relationships that are interesting. If there is no central collection of historical data like the one in the warehouse, then the data miner must go out and find the historical data. In some cases the data miner can find the historical data. But in other cases it simply does not exist. When there is a data warehouse, the miner can sit down and immediately
start to work on the historical data inside the warehouse. The data miner is a long way from any meaningful analysis when he has to first gather and assimilate the historical data.
The third reason data warehousing opens the door to effective data mining is that the warehouse contains both summary data and detailed data. Unquestionably, the miner needs the detailed data in order to do analysis. But the summary data is useful at the outset of analysis, when the miner is planning an approach and needs to quickly look over the entire collection of detailed data. When there is a representative sample of different types of summary data, the miner can quickly survey what is and what is not in the warehouse. The summary data can save the miner fruitless iterations of analysis.
Emerging Trends
What is on the horizon for the data warehouse, the data mart, and data mining?
Data Management.
One of the obvious trends is the need for the management of the warehouse environment. Data wareh
ouses and data marts tend to grow at an amazing rate. As they grow, the volume of data that finds its way into the data warehouse becomes an obstacle to success. With this growth comes a slowdown in performance and an increase in budget. Soon the organization comes to the realization that the data warehouse infrastructure needs to be managed.
One of the first discoveries the manager makes is that managing the DSS/data warehouse environment is nothing like managing the classic operational, transaction-oriented environment. The DSS/data warehouse infrastructure has its own unique set of needs and peculiarities.
One of those peculiarities is that of dormant data that creeps into the warehouse. Dormant data is data that is never used. In the early stages of a warehouse, there is little dormant data. But as time passes, the amount of dormant data increases to the point that there is much more dormant data in the warehouse than data that is actively being used. At this point, the dormant data needs to b
e archived in order to keep processing streamlined.
Other management issues include the need to constantly monitor and cleanse data as it enters the warehouse and as it resides in the warehouse.
Metadata.
As corporations mature in their understanding of the warehouse environment, one obvious technology that emerges as being very important is metadata. In its simplest form, metadata is data about data. But in a warehouse environment, you need a more sophisticated view of metadata. There are many types of metadata that are beneficial to the DSS/data warehouse infrastructure.
But metadata was not part of the first generation of data warehouses for a variety of reasons. People were so anxious to get their first-generation data warehouses started that they went after only the most obvious parts of the warehouse. These first efforts concentrated merely on getting the data into the warehouse. But we are starting to see much more sophisticated, second-generation data warehouses being built no
w that do include metadata as an integral part of the infrastructure. Also, there really wasn't any appropriate technology for capturing and managing metadata in the early days of data warehousing. It took imagination back then to see why metadata was so important. There is only so much imagination to go around. Today, after real experience with first-generation data warehouses, people base decisions on experience rather than on imagination. An experienced warehouse administrator knows just how important metadata is.
One of the major issues of the DSS/data warehouse infrastructure is that every technology seems to be found in the warehouse environment -- everything from HP, IBM, NCR, and DEC to Sun and Sequent. You'll find Oracle, Sybase, Teradata, DB2, Informix, and Red Brick. There is Information Advantage, Business Objects, Cognos, DSS Agent, and Brio. Trying to gain a consensus of opinion among these vendors is impossible. Yet in order to be successful, there must be shareability and manag
eability of metadata across these vendors and products. One of the biggest challenges facing the metadata manager in the warehouse environment is crossing the technological barriers found in the environment.
Of course, metadata in one form or another has been around for a long time, and there have been some limited successes. But there are some severe limitations with the repository approach to the management of metadata as it applies to data warehousing. The primary limitation is that it does not account for the need for autonomy of processing by the end user. When the end user is working away on a Saturday afternoon in Lotus 1-2-3, she is not about to let an administrator interfere with the flow of work and analysis. The very essence of much of end user processing is the freedom of the end user from IT control. End users simply are not going to stand for an administrator telling them what can and can't be done. And anyway, what are Lotus and the other spreadsheets but metadata?
But there are oth
er architectural issues relating to metadata management across the DSS environment. If you have no central control of metadata, you will never have any uniformity of definition of data. There will never be any consistency of processing across the organization.
There needs to be a balance between shareability of metadata and autonomy of metadata. The balance can be achieved only by a distributed metadata architecture, where the different nodes of the architecture have their own metadata. A distributed approach to metadata management is the only viable approach in the DSS/data warehouse environment.
Tools.
There are cube technologies, relational technologies, and powerful spreadsheets. As more data becomes truly available to the end user, the tools that cater to the particular needs of the end user will have even greater variations and capabilities than they have today.
Storage and Archival Technologies.
There is so much data pouring into the warehouse environment that standard
disk storage cannot possibly -- economically and technologically -- hold it all. Furthermore, with data warehousing there is no need to hold all the data in an on-line mode. Near-line software and hardware that will allow data to reside on a hierarchy of storage will soon emerge.
On a little bit longer horizon, archival technology will surely appear. There is much to the subject of archival technology, and if there is one Achilles' heel of data warehousing, this is it. Today's archival technology is crude compared to what will be coming tomorrow.
Metaprocess.
Metadata technology in one form or another has been around for a while. But metaprocess technology has an important place and will start to appear commercially in the next few years.
illustration_link (20 Kbytes)

illustration_link (58 Kbytes)

illustration_link (21 Kbytes)

Warehouse, mart, and snowflake structures apply to different parts of an enterprise.
Bill Inmon, the father of the data warehouse concept, is chief technology officer of Pine Cone Systems (Englewood, CO), a company that builds and markets software for managing the data warehouse/data mart environment. You can reach him c/o
editors@bix.com
.