). No matter what the cause, dirty data damages the credibility of an entire warehouse implementation.
Fortunately, data-cleansing tools can help. In some cases, you might be able to create an effective cleansing script yourself. Commercial tools, on the other hand, might be mandatory if your database is large or especially inaccurate and inconsistent. Deciding which tool to use is important, and not only for the sake of data integrity. If you choose the wrong approach, you could squander weeks of programming resources or hundreds of thousands of dollars in tool costs.
Data Cleaning
Cleaning up dirty data is a complex, multifaceted process that starts with an analysis of your corporate data to uncover inaccuracies, anomalies, and other problems. You then must transform the data to make sure it's accurate and consistently represented. Next comes the step of ensuring
referential integrity
, the ability of the data warehouse to correctly i
dentify every instance of every business object, such as a product, a customer, or an employee. Next, you validate the data using the data-warehouse application to perform test queries.
After you're confident about the validity of your data, you need to produce
metadata
, a description of the data type, format, and business meaning of each field. Last comes the crucial step of documenting the whole process so you can more easily expand, modify, and repair the data in the future. In practice, you might have to perform multiple steps as part of a single operation or when using a single tool. In particular, cleaning data and ensuring referential integrity are interdependent processes.
Commercial tools can help you with each of these steps. However, it's possible to write your own programs to do the job. For example, CompuCom Systems, a large Dallas-based systems integrator, implemented a 12-million-record, 10-GB warehouse for internal decision support and for customers to view order-status and pr
oduct information via the Web.
CompuCom implemented some data-enhancement routines in the fourth-generation language (4GL) associated with its Progress database, which runs on an HP 9000. Enhancement includes decoding values of columns into short English descriptions, or
mnemonics
. Data-scrubbing code, such as date and data-type conversions, was written in C. One advantage of this approach is that CompuCom now owns these routines and can use them in other applications.
Users helped define the data-scrubbing requirements. "They know the data the best," says Tony Birrittier, data-warehouse project manager at CompuCom. "They inform us about what type of dirty data is out there and how to clean it."
Birrittier says the company didn't go with a commercial cleansing tool because most of its data is in the same basic form; thus, the company can easily reuse the routines it writes. The main drawback has been the amount of development time -- typically about a week -- needed to create the routines
. "We're having a difficult time keeping up with demand," Birrittier explains. "We've been looking at some [commercial] software packages, but we haven't found anything on the market that's a better fit here at CompuCom."
Some users have found that in-house data scrubbing takes too long to be practical. Ohio Casualty Insurance (Hamilton, OH) experimented for two years with in-house cleansing, using COBOL programs, before turning to a commercial tool, the Integrity Data Reengineering Tool from Vality Technology.
The Ohio Casualty data warehouse combines records associated with about 1 million personal insurance policies, including auto and homeowner's policies. As a test run, the company started out with 3500 of its employees' policies.
"It's a total nightmare to try to program for all situations that you might run across," says Susan Parsons, Ohio Casualty's data administrator. Even after spending a year developing generic extract/transform/load programs, it took another year, using COBOL programm
ing and manual editing, to get the policy data usable for the warehouse.
The Vality Integrity Data Reengineering Tool helps attack the first set of customer data -- about 15,000 policies at the company's Denver business center. Although Ohio Casualty personnel still needed to comb through anomalies that the Vality product discovered, no programming or manual editing of the data was required. The data was ready for the warehouse in about six weeks.
Intel (Hillsboro, OR) is an example of a company that has successfully done some in-house data cleaning, though against its will. Intel originally intended to outsource its data cleansing to a service bureau, for a customer-contact warehouse of approximately 1 million records taken from five operational systems. The service bureau promised to identify relationships among various groups within customer companies. In addition, the bureau would provide industry information for customer organizations, such as number of employees, revenues, and growth, which woul
d be valuable to Intel's sales and marketing people. Unfortunately, the service bureau did not do a perfect job of identifying relationships among customers because some contact people became associated with the wrong companies.
Intel took the tape it got back from the service bureau and then ran the data through the SAS statistical-analysis package, from the SAS Institute, to identify and correct problems with the relationships in the top 10 groupings (i.e., those with the most companies in a single relationship hierarchy). The company then used Oracle database tools to do further analysis and cleansing. Since new data was coming in all the time, some of the Oracle cleansing routines were implemented as stored procedures so that they could automatically be run against new data.
Intel would still like to outsource the data-cleaning task. However, the company plans to keep it in-house until it finds an acceptable service bureau.
Many companies also turn to consultants for custom solutions. For inst
ance, CrediCard (São Paulo, Brazil), a large credit-card issuer in South America, got data-scrubbing and enhancement tools as part of a data-warehouse implementation by Market Knowledge, a subsidiary of Equifax. CrediCard's marketing personnel use approximately 200 custom routines to perform cleansing operations, such as removal of bad or useless data, correction of bad values, and standardization of varying formats. In addition, they can enhance data by performing such operations as correcting monetary amounts for inflation and devaluation, creating a virtual age field based on a person's birth date, and appending census data to incoming records.
These custom routines (e.g., inflation correction) are particularly suited to Brazilian requirements. They are also designed for use by nontechnical marketing personnel. The data-scrubbing routines, which are programmed as SQL commands, took only about three person-weeks to create -- a negligible portion of a 2-1/2-year project. (Data-enhancement tools, w
hich are more automated and more intelligent, represent about $120,000 of the total $840,000 job.)
Data-scrubbing scripts don't provide much intelligence, so firms need to manually provide the decision-making, based on top-line and data-audit reports. Each time a new set of data is loaded, data scrubbing usually constitutes about 25 percent of what is perhaps a four-week process.
Light Cleaning
Even if you decide not to program data-cleansing functions yourself or hire a consultant to do a custom job, you may not need to buy a tool specifically for the task: Your data-warehouse management software might do enough cleaning and validation for your purposes. Many data-warehouse projects rely on a product such as Warehouse Manager, from
Prism Solutions
, or Passport, from Carleton, for a range of data-warehouse management tasks, including extracting data from operational databases, preparing data for loading into a warehouse database, and managing metadata. These
products, which cost from $75,000 to more than $200,000, depending on the size and complexity of the project, can also do cleaning, transformation, and validation.
For instance, Emory University (Atlanta) does all the data cleansing for its 6-GB warehouse with COBOL programs generated by the Prism Warehouse Manager. In addition to having typical problems, such as multiple date formats, data often contains uninitialized fields that hold arbitrary values. Two staff members spend as much as 4 hours a day working on data-cleansing tasks. Emory has considered using specialized data-cleansing tools, but the school is eliminating dirty data well enough now that it doesn't see enough additional value in other commercial products to justify the cost, according to Barbara Germon, manager of the data-resources group.
However, there's a good chance that the aforementioned tools from Prism and Carleton won't do all the cleaning you need. They might be able to deal with common anomalies that can be handled through
simple lookup tables (e.g., recognizing that
Street
and
St.
represent the same information), but they might not deal successfully with more important and unpredictable irregularities. "It's worth doing some testing," says Wayne Eckerson, a senior consultant at the Patricia Seybold Group in Boston, Massachusetts, "but these tools are not designed to do 'heavy-lifting' types of cleaning."
If the data that requires cleaning consists predominantly of names (including company names) and addresses, companies such as Harte-Hanks Communications and Innovative Systems provide not only software tools but periodically updated data files to help match variants of company names, detect ZIP codes that don't match the addresses provided, and deal with similar anomalies. These tools might be appropriate where, for instance, fields other than those for names and addresses are either known to be correct (e.g., contract dollar amounts that have been validated by accounting) or contain free-form information
that will never be used as a key or index (e.g., salespeople's contact notes).
Name-and-address-oriented solutions can cost anywhere from $30,000 to more than $200,000, depending on the size of the data warehouse in question. In addition, an extract/transform/load (ETL) tool, such as Warehouse Manager or Passport, is also needed.
The Heavyweights
For extensive cleaning jobs, consider tools that are developed for that task. Two top contenders are
Enterprise/Integrator
, from Apertus Technologies, and Vality's Integrity Data Reengineering Tool.
Enterprise/Integrator takes a top-down approach, in which you have to provide the rules for cleaning the data. This is a straightforward strategy, where you impose your understanding about your business on the data. For instance, do you wish to treat a string of Martha's Fried Chicken franchises as a single customer with multiple addresses? Or, for the purposes of the data warehouse, does it make sense to substitute
a single central address for the different franchise addresses? Or would you like to treat the franchise locations as entirely different customers? This decision determines how you aggregate or consolidate these records and whether you treat differing addresses for Martha's Fried Chicken as anomalies.
Enterprise/Integrator offers not just data cleaning, but also extraction, transformation, data loading, replication, synchronization, and metadata management. It's fairly expensive ($130,000 to $250,000), but it can be a money-saver if it eliminates the need for other data-warehouse management tools.
The main disadvantage to Enterprise/Integrator's top-down approach is that you have to know, or be able to deduce, the data-cleansing and business rules. Apertus provides samples to work from, which deal with many common anomalies and business structures. Still, creating rules is time-consuming, and you're sure to find some unexpected anomalies. These can be handled manually through an exception-handling sys
tem -- but it's a time-consuming process.
Vality's Integrity Data Reengineering Tool takes a bottom-up approach. It analyzes the data character by character and automatically surfaces patterns and business rules. Integrity then provides a data "blueprint" to help standardize, condition, and consolidate the data. This approach tends to leave fewer exceptions to be handled manually, and the process tends to be less time-consuming.
Like Enterprise/Integrator, Integrity can take into account business relationships that aren't obvious from the data, such as mergers and acquisitions that have taken place since the data was created. But with either tool, these rules must be imposed in a top-down fashion.
Integrity focuses exclusively on data cleansing, starting from flat files. It does not extract data from operational databases, load data into the warehouse database, replicate and synchronize data, or manage metadata. Therefore, in addition to typically costing $250,000, Integrity might also require a t
ool like Warehouse Manager or Passport. But the simple extraction/load utilities available with the database might be sufficient for you.
Facing the Threat
Dirty data is a serious threat to the success of a data-warehouse project. Depending on the extent of the problem, it simply might not be possible to address it quickly, inexpensively, and well.
In-house programming takes time; serious tools are expensive. Warehouse project managers need to realistically assess the problem, and the in-house resources available to deal with it, and select the solution that fits the schedule and the budget of the project -- or modify the schedule and the budget to fit the problem.
Where to Find
Apertus Technologies, Inc.
Eden Prairie, MN
Phone: (800) 328-3998 or (612) 828-0300
Fax: (612) 828-0454
Internet:
http://www.apertus.com