Smart data miners are cashing in on valuable information buried in private and public data sources
Sara Reese Hedberg
It's in there. The discovery, the fact, the one piece of the puzzle that will blow away the competition, propel your company to the top, and stick a "VP" after your name. It's right there, in your database. But you can't see it. Yet.
The amount of information stored in databases is exploding. From zillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. In today's fiercely competitive business environment, companies need to rapidly turn those terabytes of raw data into significant insights to guide their marketing, investment, and management strategies.
It would take many lifeti
mes for an analyst to pore over 2 million books -- the equivalent of a terabyte -- to glean important trends. But analysts have to. For instance, Wal-Mart, the chain of over 2000 retail stores, every day uploads 20 million point-of-sale transactions to an AT&T massively parallel system with 483 processors running a centralized database. At corporate headquarters, they want to know trends down to the last Q-Tip.
Luckily, computer techniques are now being developed to assist analysts in their work. Data mining (DM), or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the
meaning
of the data nuggets. DM is being used both to describe past trends and to predict future trends.
Mining and Refining Data
Experts involved in significant DM efforts agree that the DM process must begin with the business problem. Since DM is really providing a platform or workbench for the analyst, understanding t
he job of the analyst logically comes first. Once the DM system developer understands the analyst's job, the next step is to understand those data sources that the analyst uses and the experience and knowledge the analyst brings to the evaluation.
The DM process generally starts with collecting and cleaning information, then storing it, typically in some type of data warehouse or datamart (see the figure
"Data-Mining Process"
). But in some of the more advanced DM work, such as that at AT&T Bell Labs, advanced knowledge-representation tools can logically describe the contents of databases themselves, then use this mapping as a meta-layer to the data. Data sources are typically flat files of point-of-sale transactions and databases of all flavors. There are experiments underway in mining other data sources, such as IBM's project in Paris to analyze text straight off the newswires.
DM tools search for patterns in data. This search can be performed automatically by the syste
m (a bottom-up dredging of raw facts to discover connections) or interactively with the analyst asking questions (a top-down search to test hypotheses). A range of computer tools -- such as neural networks, rule-based systems, case-based reasoning, machine learning, and statistical programs -- either alone or in combination can be applied to a problem.
Typically with DM, the search process is iterative, so that as analysts review the output, they form a new set of questions to refine the search or elaborate on some aspect of the findings. Once the iterative search process is complete, the data-mining system generates report findings. It is then the job of humans to interpret the results of the mining process and to take action based on those findings.
AT&T, A.C. Nielsen, and American Express are among the growing ranks of companies implementing DM techniques for sales and marketing. These systems are crunching through terabytes of point-of-sale data to aid analysts in understanding consumer behavi
or and promotional strategies. Why? To increase profitability, of course.
Similarly, financial analysts are plowing through vast sets of financial records, data feeds, and other information sources in order to make investment decisions. Health-care organizations are examining medical records in order to understand trends of the past; they hope this information can help reduce their costs in the future. Major corporations such as General Motors, GTE, Lockheed, Microsoft, and IBM all have R&D groups working on proprietary advanced DM techniques and applications.
Siftware
Hardware and software vendors are extolling the DM capabilities of their products -- whether they have true DM capabilities or not. This hype cloud is creating much confusion about data mining. In reality, data mining is the process of sifting through vast amounts of information in order to extract meaning and discover new knowledge.
It sounds simple, but the task of data mining has quickly overwhelme
d traditional query-and-report methods of data analysis, creating the need for new tools to analyze databases and data warehouses intelligently. The products now offered for DM range from on-line analytical processing (OLAP) tools, such as Essbase (Arbor Software ) and DSS Agent (MicroStrategy), to DM tools that include some AI techniques, such as IDIS (Information DIscovery System, from IntelligenceWare) and the Database Mining Workstation (HNC Software), to the new vertically targeted advanced DM tools, such as those from AT&T Global Information Solutions. (See the article "A Data Miner's Tools" for more information on DM products.)
Many people argue that the OLAP tools are not "true" mining tools; they're fancy query tools, they say. Since these programs perform sophisticated data access and analysis by rolling up numbers along multiple dimensions, some analysts still include them in the category of top-down mining tools. The market has yet to see much in the way of more-advanced mining tools, althou
gh the spigot is being turned on by application-specific DM tools from AT&T, Lockheed, and GTE.
Let's Get Vertical
One major DM trend is the move toward powerful application-specific mining tools. "There is a trade-off in the generality of data-mining tools and ease of use," observes Gregory Piatetsky-Shapiro, principal investigator of the Knowledge Discovery in Databases Project at GTE Laboratories. "General tools are good for those who know how to use them, but they really require lots of knowledge to use them."
AT&T, for example, recently introduced Sales & Marketing Solution Packs to mine data warehouses. They're tailored to vertical markets in retail, financial, communications, consumer-goods manufacturing, transportation, and government. These programs provide about 70 percent of the solution, with final tailoring required to fit the individual client's needs, AT&T says. Complete with AT&T parallel hardware, software, and some services, Solution Packs start at around
$250,000.
Both GTE and Lockheed Martin may shortly follow suit. GTE is already entertaining proposals to turn its Health-KEFIR (KEy FIndings Reporter) into a commercial product (see the "Health Care" sidebar). The Artificial Intelligence Research group at Lockheed Martin has been investigating and developing DM tools for the past 10 years. Recently, the Lockheed group built an internal application-development tool, called Recon, that generalizes their DM techniques, then applied it to application-specific problems. A beta version of the first vertical packages -- for finance and marketing -- will be available in 1996. The system has an open architecture, running on Unix platforms and massively parallel supercomputers. It interfaces with existing relational database management systems, financial databases, proprietary databases, data feeds, spreadsheets, and ASCII files.
In a similar vein, several neural network tools have been customized. Customer Insight Co., for instance, has built an interface
to link its Analytix marketing software with HNC Software's neural network-based Database Mining Workstation, creating a marketing DM hybrid. HNC Software's Falcon detects credit-card fraud; according to HNC, the program is watching millions of charge accounts.
Invasion of the Data Snatchers
The need for DM tools is growing as fast as data stores swell. More-sophisticated DM products are beginning to appear that perform bottom-up as well as top-down mining. The day is probably not too far off when intelligent agent technology will be harnessed for the mining of vast public on-line sources, traversing the Internet, searching for information, and presenting it to the human user. Microelectronics and Computer Technology Corp. (MCC, Austin, TX) has been pioneering work in this area, developing a platform, called Carnot, for its consortium members. Carnot-based agents have been successfully applied to both top-down and bottom-up DM of distributed heterogeneous databases at Eastman C
hemical.
"Data mining is evolving from answering questions about what has happened and why it happened," observes Mark Ahrens, director of custom software sales at A.C. Nielsen. "The next generation of DM is focusing on answering the question `How can I fix it?' and making very specific recommendations. That's our focus now -- our Holy Grail." Meanwhile, the gold rush is on.
MINING OTHER DATA-MINING RESOURCES
Information about DM research, applications, and tools can be found on the Knowledge Discovery Mine Website at http://info.gte.com/kdd/.
Advances in Knowledge Discovery & Data Mining, U. Fayyad, et al., editors; AAAI/MIT Press, 1995.
Proceedings of the First International Conference on Knowledge Discovery and Data Mining, U. Fayyad, et al., editors; AAAI Press, 1995.
PRODUCT INFORMATION
Analytix $100,000 and up
Customer Insight Co.
Englewood, CO
(800) 262-5989
(303) 790-7002
fax: (303) 643-1535
Database MiningWorkstation $51,000
(for software and PC board, processing, and training)
HNC Software
San Diego, CA
(619) 546-8877
fax: (619) 452-6524
pdc@hnc.com
Falcon $250,000-$1 million
HNC Software
(see above)
IDIS PC $1900
Server depends on number of
records to be processed
(e.g., 1 million records,
$25,000)
IntelligenceWare
Torrance, CA
(310) 782-3340
fax: (310) 782-7565
datamine@ix.netcom.com
Prism
$400,000-$1 million
Nestor
Providence, RI
(401) 331-9640
fax: (401) 331-7319
Spotlight site license
A.C. Nielsen
Schaumberg, IL
(708) 605-5000
illustration_link (14 Kbytes)

Data mining finds useful nuggets of information in existing data sources. DM tools search for patterns in data; this process can be automated, or it can involve an analyst asking questions.
Sara Reese Hedberg is a freelance writer who lives in Issaquah, Washington. She has written extensively about emerging computer technologies. She can be reached at
editors@bix.com
.