Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers

ArticlesThe Data Gold Rush


Octo ber 1995 / State Of The Art / The Data Gold Rush

Smart data miners are cashing in on valuable information buried in private and public data sources

Sara Reese Hedberg

It's in there. The discovery, the fact, the one piece of the puzzle that will blow away the competition, propel your company to the top, and stick a "VP" after your name. It's right there, in your database. But you can't see it. Yet.

The amount of information stored in databases is exploding. From zillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. In today's fiercely competitive business environment, companies need to rapidly turn those terabytes of raw data into significant insights to guide their marketing, investment, and management strategies.

It would take many lifeti mes for an analyst to pore over 2 million books -- the equivalent of a terabyte -- to glean important trends. But analysts have to. For instance, Wal-Mart, the chain of over 2000 retail stores, every day uploads 20 million point-of-sale transactions to an AT&T massively parallel system with 483 processors running a centralized database. At corporate headquarters, they want to know trends down to the last Q-Tip.

Luckily, computer techniques are now being developed to assist analysts in their work. Data mining (DM), or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data nuggets. DM is being used both to describe past trends and to predict future trends.

Mining and Refining Data

Experts involved in significant DM efforts agree that the DM process must begin with the business problem. Since DM is really providing a platform or workbench for the analyst, understanding t he job of the analyst logically comes first. Once the DM system developer understands the analyst's job, the next step is to understand those data sources that the analyst uses and the experience and knowledge the analyst brings to the evaluation.

The DM process generally starts with collecting and cleaning information, then storing it, typically in some type of data warehouse or datamart (see the figure "Data-Mining Process" ). But in some of the more advanced DM work, such as that at AT&T Bell Labs, advanced knowledge-representation tools can logically describe the contents of databases themselves, then use this mapping as a meta-layer to the data. Data sources are typically flat files of point-of-sale transactions and databases of all flavors. There are experiments underway in mining other data sources, such as IBM's project in Paris to analyze text straight off the newswires.

DM tools search for patterns in data. This search can be performed automatically by the syste m (a bottom-up dredging of raw facts to discover connections) or interactively with the analyst asking questions (a top-down search to test hypotheses). A range of computer tools -- such as neural networks, rule-based systems, case-based reasoning, machine learning, and statistical programs -- either alone or in combination can be applied to a problem.

Typically with DM, the search process is iterative, so that as analysts review the output, they form a new set of questions to refine the search or elaborate on some aspect of the findings. Once the iterative search process is complete, the data-mining system generates report findings. It is then the job of humans to interpret the results of the mining process and to take action based on those findings.

AT&T, A.C. Nielsen, and American Express are among the growing ranks of companies implementing DM techniques for sales and marketing. These systems are crunching through terabytes of point-of-sale data to aid analysts in understanding consumer behavi or and promotional strategies. Why? To increase profitability, of course.

Similarly, financial analysts are plowing through vast sets of financial records, data feeds, and other information sources in order to make investment decisions. Health-care organizations are examining medical records in order to understand trends of the past; they hope this information can help reduce their costs in the future. Major corporations such as General Motors, GTE, Lockheed, Microsoft, and IBM all have R&D groups working on proprietary advanced DM techniques and applications.

Siftware

Hardware and software vendors are extolling the DM capabilities of their products -- whether they have true DM capabilities or not. This hype cloud is creating much confusion about data mining. In reality, data mining is the process of sifting through vast amounts of information in order to extract meaning and discover new knowledge.

It sounds simple, but the task of data mining has quickly overwhelme d traditional query-and-report methods of data analysis, creating the need for new tools to analyze databases and data warehouses intelligently. The products now offered for DM range from on-line analytical processing (OLAP) tools, such as Essbase (Arbor Software ) and DSS Agent (MicroStrategy), to DM tools that include some AI techniques, such as IDIS (Information DIscovery System, from IntelligenceWare) and the Database Mining Workstation (HNC Software), to the new vertically targeted advanced DM tools, such as those from AT&T Global Information Solutions. (See the article "A Data Miner's Tools" for more information on DM products.)

Many people argue that the OLAP tools are not "true" mining tools; they're fancy query tools, they say. Since these programs perform sophisticated data access and analysis by rolling up numbers along multiple dimensions, some analysts still include them in the category of top-down mining tools. The market has yet to see much in the way of more-advanced mining tools, althou gh the spigot is being turned on by application-specific DM tools from AT&T, Lockheed, and GTE.

Let's Get Vertical

One major DM trend is the move toward powerful application-specific mining tools. "There is a trade-off in the generality of data-mining tools and ease of use," observes Gregory Piatetsky-Shapiro, principal investigator of the Knowledge Discovery in Databases Project at GTE Laboratories. "General tools are good for those who know how to use them, but they really require lots of knowledge to use them."

AT&T, for example, recently introduced Sales & Marketing Solution Packs to mine data warehouses. They're tailored to vertical markets in retail, financial, communications, consumer-goods manufacturing, transportation, and government. These programs provide about 70 percent of the solution, with final tailoring required to fit the individual client's needs, AT&T says. Complete with AT&T parallel hardware, software, and some services, Solution Packs start at around $250,000.

Both GTE and Lockheed Martin may shortly follow suit. GTE is already entertaining proposals to turn its Health-KEFIR (KEy FIndings Reporter) into a commercial product (see the "Health Care" sidebar). The Artificial Intelligence Research group at Lockheed Martin has been investigating and developing DM tools for the past 10 years. Recently, the Lockheed group built an internal application-development tool, called Recon, that generalizes their DM techniques, then applied it to application-specific problems. A beta version of the first vertical packages -- for finance and marketing -- will be available in 1996. The system has an open architecture, running on Unix platforms and massively parallel supercomputers. It interfaces with existing relational database management systems, financial databases, proprietary databases, data feeds, spreadsheets, and ASCII files.

In a similar vein, several neural network tools have been customized. Customer Insight Co., for instance, has built an interface to link its Analytix marketing software with HNC Software's neural network-based Database Mining Workstation, creating a marketing DM hybrid. HNC Software's Falcon detects credit-card fraud; according to HNC, the program is watching millions of charge accounts.

Invasion of the Data Snatchers

The need for DM tools is growing as fast as data stores swell. More-sophisticated DM products are beginning to appear that perform bottom-up as well as top-down mining. The day is probably not too far off when intelligent agent technology will be harnessed for the mining of vast public on-line sources, traversing the Internet, searching for information, and presenting it to the human user. Microelectronics and Computer Technology Corp. (MCC, Austin, TX) has been pioneering work in this area, developing a platform, called Carnot, for its consortium members. Carnot-based agents have been successfully applied to both top-down and bottom-up DM of distributed heterogeneous databases at Eastman C hemical.

"Data mining is evolving from answering questions about what has happened and why it happened," observes Mark Ahrens, director of custom software sales at A.C. Nielsen. "The next generation of DM is focusing on answering the question `How can I fix it?' and making very specific recommendations. That's our focus now -- our Holy Grail." Meanwhile, the gold rush is on.


MINING OTHER DATA-MINING RESOURCES

Information about DM research, applications, and tools can be found on the Knowledge Discovery Mine Website at http://info.gte.com/kdd/.

Advances in Knowledge Discovery & Data Mining, U. Fayyad, et al., editors; AAAI/MIT Press, 1995.

Proceedings of the First International Conference on Knowledge Discovery and Data Mining, U. Fayyad, et al., editors; AAAI Press, 1995.


PRODUCT INFORMATION

Analytix                                $100,000 and up
Customer Insight Co.
Englewood, CO
(800) 262-5989
(303) 790-7002
fax: (303) 643-1535

Database MiningWorkstation                             $51,000
(for software and PC board, processing, and training)
HNC Software
San Diego, CA
(619) 546-8877
fax: (619) 452-6524
pdc@hnc.com

Falcon                                  $250,000-$1 million
HNC Software
(see above)

IDIS PC                                 $1900
Server                                  depends on number of
                                        records to be processed
                                        (e.g., 1 million records,
                                        $25,000)
IntelligenceWare
Torrance, CA
(310) 782-3340
fax: (310) 782-7565
datamine@ix.netcom.com

Prism
                                   $400,000-$1 million
Nestor
Providence, RI
(401) 331-9640
fax: (401) 331-7319


Spotlight                               site license
A.C. Nielsen
Schaumberg, IL
(708) 605-5000

Data-Mining Process

illustration_link (14 Kbytes)

Data mining finds useful nuggets of information in existing data sources. DM tools search for patterns in data; this process can be automated, or it can involve an analyst asking questions.


Sara Reese Hedberg is a freelance writer who lives in Issaquah, Washington. She has written extensively about emerging computer technologies. She can be reached at editors@bix.com .

Up to the State Of The Art section contentsGo to previous article: Data MiningGo to next article: Marketing
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network