Blasting loose those buried nuggets of information requires clean data, warehousing strategies, powerful parallel processors, and heaps of hard disk space
Cheryl D. Krivda
Nothing loosens up that pesky nugget like a well-placed stick of dynamite. Similarly, there are ways in which you can significantly speed up and simplify your data-mining activities. Data-cleansing and data-fusion tools can transform bales of operational data into error-free, consistently formatted information. Data warehouses support storage and access on specialized servers. Parallel-processing techniques accelerate data mining's complex queries. And, when you have terabytes of data to stash, storage considerations are important, especially as the price of disk space plummets.
Why bother? Remember: The whole point of data
mining is to reveal hidden information for prompt decision-making and action. And there's also the matter of the 1000 percent return on investment that some data-mining pioneers are enjoying.
Clean and Scrubbed
While preparing to initiate some data mining, one New York City merchant bank discovered that its databases contained up to 13 different representations of some customer's names, such as Andrew B. Jones/Mr. A. Jones. This is but one example of a common problem: Databases storing a company's lifeblood information about customers and transactions are commonly rife with errors, duplicate data (or worse), and information that would not necessarily be useful to data-mining applications. One bank has already saved $170,000 per month that it had been spending on duplicate mailings by cleaning up its customer data and then housing it with associated account information, according to Peter Kastner, an analyst with the Aberdeen Group, a Boston-based market-research and computer-ind
ustry consulting firm.
For data-mining applications to produce valid results, data has to be cleaned of errors and "scrubbed" to create consistent formats (e.g., 1s and 0s become male and female). The process can be a slow one. Although reasonably priced tools are available to reformat data, cleanse it, and prepare it for eventual migration to a data warehouse (explained later), information-systems (IS) management must also dedicate time to the effort of determining which format to use and how data should be represented before warehousing it.
The payoffs can be dramatic. One telephone company was able to mine its cleaned and warehoused data to identify 10,000 supposedly "residential" customers who spent more than $1000 per month on their phone bills. Investigating more closely, the telephone company discovered that these customers were really small businesses trying to avoid paying business rates for their service.
A Place for Everything
The mainframe: A nice place t
o visit, but you wouldn't want to mine there. For one thing, production mainframes often support day-to-day business activities (e.g., airline reservations) that you probably do not want to impede with your killer queries. In addition, you probably want a well-behaved data server that you can hang your mining clients from. Besides, the data structures on your mainframe probably aren't the best for the data-mining activities you want to perform.
Enter the
data warehouse
, a server-based replication of a mainframe's data. The server receives updated information from the mainframe periodically -- monthly, weekly, or even daily, depending on needs. The database on the data-warehouse server then fields data-mining queries from the client machines independently of the mainframe.
A data warehouse provides an effective structure for data mining, explains Bill Inmon, cofounder of Prism Solutions (Sunnyvale, CA) and widely considered the originator of the data-warehousing concept. Without first ware
housing its data, a company has lots of information that is not integrated and has little summary information or history. The effectiveness of mining such data, he says, is limited.
Think
data warehousing
sounds easy? In actuality, the difficulty of getting a data warehouse operational is one of the many reasons why sites building them typically engage outside expertise to facilitate the project. Although consultants with experience in designing and populating data warehouses are few and far between, and expensive when available, they are typically worth the investment, especially for high-profile projects.
Still, convinced that data cleansing and warehousing sound like jobs that their best analyst will be able to handle, some companies elect to go it alone. "We see a lot of cheap science projects," says Chuck Buffum, general manager of the decision-support business unit for Tandem Computers (Cupertino, CA).
One Fortune 100 company IS manager waited three years for his
staff to complete a high-profile data warehouse, relates Dr. Kamran Parsaye, CEO of Information Discovery (Hermosa Beach, CA). When poor warehouse design prevented the company from successfully populating the warehouse with clean data, the IS manager directed his staff to mine the production data and deep-six the warehouse plans. "It was success by declaration," Parsaye says.
1000 Percent Return on Investment
For all the warnings about the dangers involved in building a useful data warehouse, the possibilities for competitive advantage and other market gain seem equally endless for those who build one successfully. Some retail companies have achieved a payoff of 10 to 70 times their initial investment of $350,000 to $750,000, notes David Gelardie, manager of commercial markets for IBM's RS/6000 Division (Somers, NY). At the high-priced end of the hardware spectrum, one customer who invested $20 million in a complete system achieved a payoff in just four months.
Kastner is
not at all surprised by these phenomenal ROI figures. "The first thing you do is skim the cream off the top," he says. "And in this market, there's lots of cream."
The propensity to mine additional profits or a competitive advantage from well-designed warehouses has some advocates talking about the so-called One-Query Theory. Buffum states that "there exists in every shop one query that -- if you figure out what it is and implement the knowledge derived from it -- will pay for the entire data-warehousing and data-mining system."
Warehouse Sandwich
Parsaye is well known in the data-warehousing field for what's known as his Sandwich Paradigm, a philosophy for building data warehouses that encourages acceptance of the probability that the first iteration of a data-warehousing effort will probably require substantial revision.
Instead of recommending that users buy hardware and software, build a data warehouse, and then load their data, the Sandwich Paradigm advises tha
t they pre-mine the data to determine what formats and data are needed to support a data-mining application, build a prototype mini-data warehouse (the "meat" of the sandwich) with most of the features anticipated in the final product, revise the strategies as necessary, and finally build a full warehouse. In this way, problems with poor-quality data or ineffective designs can be rectified before thousands or even millions of dollars are wasted. Often users realize that their data structures are ill-designed, or they gain so much experience with the prototype that they realize that a second chance at building the system could add value that would justify the expense.
To support newly cleansed data, experts advocate maintaining a collection of information about the transformation process. This
meta-data
describes the contents of the data warehouse; where the warehoused data originated; and the translations, aggregations, table lookups, and other transformations that occurred in the warehousing p
rocess. For many data-warehousing users, this information becomes critical after the migration is completed, when more detail is needed about a particular group of data or when errors in the warehoused data become evident.
Specific information about successful data-warehousing efforts is sparse, partly because such an effort is complex and extensive, and partly because the companies that succeed are loath to divulge what they perceive as a competitive advantage. It is known, however, that successful data warehouses are usually only half as large as is necessary to support the number of users and applications that quickly become "critical."
When a data warehouse succeeds, more users than initially planned want to employ data mining to their advantage. Data-warehousing and data-mining applications are usually installed with a specific motive in mind, such as fraud detection or profit generation, says Kastner.
But once a data warehouse is installed, the number of users almost always increases w
ith the success of the system. For instance, one department-store chain based in the U.K. implemented a data-mining system to better understand customer-buying trends. Within six months, one of the biggest users of the system was the chain's accounting department, which began tracking profit and loss leaders. Analysts discovered that the shoplifting of batteries, film, and midpriced pens cost the chain $60,000 a month. These products were moved to a more secure store location, saving the chain over $700,000 annually.
One offshoot of the fast-growing data-warehousing and data-mining market is that tool and database vendors are announcing partnerships in record numbers. But "they're not worth the paper they're printed on," warns Kevin Strange, research director for industry analyst the Gartner Group (Stamford, CT). With rare exceptions -- such as Platinum Technology (which is creating synergies by acquiring companies such as Trinzic) and
Software AG
(which is creating an initiativ
e among several vendors) -- industry partnerships in this market do not benefit customers. But the good news here for buyers is that the efforts made by Software AG and Platinum will force other vendors to create alliances, Strange says, probably by the end of the year.
Parallel Power
Parallel processing
speeds up the work of decision-support systems such as data mining by dividing a complex query into multiple parts and assigning each part to a separate processor. The processors work concurrently, unlike serial processors, which address one process after another.
While data mining is unquestionably being driven by the affordability of highly powerful parallel-processing computers, interest in data mining is also driving parallel sales; it's a synergistic effect. Once the province of the scientific community and its deep-pocketed sponsors, parallel-processing machines have dropped in price by 30 percent to 40 percent annually for the past sever
al years, bringing them into the realm of the affordable.
Parallel-processing systems are available in two distinct flavors: symmetric multiprocessing (SMP) and massively parallel processing (MPP). SMP systems share a common memory among clusters of machines. MPP systems are often called "shared nothing" or distributed-memory systems, because each processor has its own memory.
SMP systems are typically used for smaller data warehouses holding 100 GB of data or less, while MPP machines are necessary once a data warehouse hits 500 GB. The gray area in between is something of a battleground among vendors, who argue points of scalability, cost, and performance.
Clustered SMP systems are well suited to applications with
lumpy data
, which is heavily used for data-mining queries. Because the processors are clustered and memory is shared, no one processor is exclusively assigned to access the key data, thus preventing bottlenecks. In contrast, MPP vendors suggest that sites planning for man
y users or heavy data volumes use MPP systems rather than clustering multiple SMP systems.
Sales of SMP-based systems are projected to be larger than those of MPP systems for the foreseeable future, predominantly because of their lower cost and better scalability, explains Howard Richmond, a vice president of the Gartner Group. Yet the MPP market is also experiencing significant growth, he adds.
Storage: Not the Final Frontier
The last enabling data-mining technology is storage. Once a no-brainer, data storage is now so important that some sites actually keep their storage devices and data under the protection of armed guards.
With data warehouses storing gigabytes and terabytes of data, and projections calling for warehouses holding hundreds of terabytes within five years, affordable storage technology is key to serious data mining. Disk-storage prices, which had been falling an average of 30 percent to 40 percent annually, are falling even faster so far this year,
and storage vendors are running hard to stay even.
Like processing, storage technologies such as RAID (an acronym for redundant arrays of independent disks) are becoming increasingly parallel. The concern of many industry watchers is the lack of a faster technology for reading data off a storage disk. There are no new storage technologies on the horizon as yet, which could make storage a bottleneck for data-mining applications as queries become more complex, storage volumes increase, and parallel machines get faster.
Pot of Gold
Every business will have a data warehouse in 10 years, predicts Kastner, whether it's the company's custom warehouse or one that's available through wideband networks connected to desktop systems. The benefits of knowing one's business and customers will become so critical that the technology will be positively pervasive, he adds.
Previously unrelated technologies are coming together to support data mining. That can only make data mining easi
er.
Data-Warehouse Construction Tips
-- Accept that your first try will require revision.
-- Examine your data: What formats and specific data are
needed to support your application?
-- Bite the bullet and clean up your data before using
it in the warehouse.
-- Build a prototype mini-data warehouse as a learning
experience and then revise strategies as necessary.
-- Plan on more users' taking advantage of a successful
data warehouse than you initially expected.
-- Keep storage options in mind: Your data is only going
to grow.
illustration_link (48 Kbytes)

illustration_link (71 Kbytes)

screen_link (30 Kbytes)

Software AG's SourcePoint Administrator lets a u
ser select a table from the warehouse and map columns to fields from the extract process.
Cheryl D. Krivda is a technical journalist based in Perkasie, Pennsylvania, specializing in information-technology topics. She can be contacted on the Internet at
5309513@mcimail.com
or on BIX c/o "editors."