Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers

ArticlesA Data Miner's Tools


October 1995 / State Of The Art / A Data Miner's Tools

Intelligent agents, multidimensional analysis tools, and good old database queries all belong in the well-equipped data miner's toolbox

Karen Watterson

Put down that pickax, Eugene. Mining the information nuggets from your data requires specialized tools. Some are brand new, and some are borrowed from other sources; some are as dumb as bricks, and some seem to have minds of their own.

Even if your job description doesn't say anything about data mining (DM) -- yet -- the fact is that most information workers spend much of each day performing DM. The right tools can give your data miners that extra edge. As Don Keough, former president of Coca-Cola, might add to any discussion of DM, "Who[ever] has information fastest and uses it wins." So grab your helmet and check out what the well-equipped da ta miner is using.

Talk to My Agent

Remember Bill Gates's vision of "information at our fingertips"? Well, the truth is that most of us have information up to our eyeballs and probably receive more in a week than we could process in several lifetimes. One challenge of DM is to develop intelligent agents that can prioritize and/or filter the data bombarding us daily, including our overflowing E-mail. But DM is also about revealing new relationships and patterns and using software agents that will do the mining for us, often performing the screening and fetching functions of yesterday's secretaries and clerks.

The notion of harnessing computers to perform tedious chores is as old as science fiction. Have you ever tried to identify buying patterns of high-margin customers? Or look for patterns that may be indicators of fraud? Do you use an Internet clipping service to provide you with a personal news summary? Or alert you to wire-service announcements related to a key competi tor? If the answer to any of these questions is yes, you're already dealing with the tasks of intelligent agents .

Several categories of intelligent agents are available. Some are launched manually to perform specific queries or to search for patterns in data. Others fire off automatically at predefined intervals, performing a task or monitoring a condition in the background and returning an alert as required. Most intelligent agents are simply short programs that say "if this happens, do that."

A handful of DM tools are sometimes lumped together under the rubric information discovery or knowledge discovery. They often have a resemblance -- algorithmically speaking -- to expert systems or AI. Most of these autonomous tools are low-touch but high-tech.

As Adam Szladow, president of Reduct Systems, a firm that markets rule-generating software, says, "One wants to get strong, repetitive patterns, patterns that occur with some frequency." One Reduct customer discovered ru les for predicting business creditworthiness based on a database that had only a few hundred cases.

Mining for Dollars

For some people, intelligent agents are the sine qua non of DM. Barry Mason, who is a principal with IBM's Consulting Group, defines DM as "discovery tools which take large amounts of detailed transaction-level data and apply mathematical techniques against it, `finding' or discovering insights into consumer behavior."

For Mason, DM is the first step in a series of activities that can lead to new actionable business intelligence. IBM has proprietary, patent-pending techniques to analyze gigantic data sets for cross-marketing or affinity marketing and look for patterns. These patterns may be so nonobvious as to appear almost nonsensical, such as that people who have bought scuba gear are good candidates for taking Australian vacations.

Other products, such as the DataEngine from MIT GmbH, use fuzzy logic and neural-network algorithms to do DM that hel ps analyze and control real-time technical processes. DataEngine, a programmer's tool not intended for casual end users, includes a visualization component that can provide hints about process bottlenecks, for example.

Reduct's Data/Logic products also ferret out patterns, automatically generating rules that can be probed using varying degrees of boundary "roughness," a technique akin to fuzzy set analysis. Wall Street analyst Murray Riggiero Jr. used Data/Logic in conjunction with neural-network software to generate rules for his trading system.

IntelligenceWare's automatic information-discovery tool, IDIS, also looks for correlations. It forms, tests, and modifies its own hypotheses until classification rules or rules with intervals, or more inexact rules, emerge. IDIS has been used successfully in applications ranging from fraud detection to consumer loan analysis to optimizing production lines.

Most humans are better at detecting anomalies than inferring relationships from large data set s, and that's why information discovery can be so useful. Rather than relying on a human to come up with hypotheses that can be confirmed or rejected based on the evidence (i.e., data), good discovery tools will look at the data and essentially generate the hypotheses.

A Dimension of Mine

Do you use your spreadsheet's crosstab or pivoting features? Have you explored data using slice-and-dice techniques to examine it from different perspectives and in varying amounts of detail? If so, you've encountered another part of the DM toolbox: multidimensional analysis (MDA), or on-line analytical processing (OLAP), tools.

Bruce Love of the Gartner Group has described DM as "an intensive search for new information and new combinations pursuing defined paths of inquiry and allowing unexpected results to generate new lines of analysis and further exploration." Love is clearly thinking of iterative exploratory techniques of data surfing using MDA or OLAP tool s. MDA represents data as n -dimensional matrices called hypercubes. OLAP and related hypercubes let users iteratively calculate metrics such as sales, revenue, market share, or inventory over a subset of available data, by exploring combinations of one or more dimensions of data.

The idea is to load a multidimensional server with data that is likely to be combined. Imagine all the possible ways of analyzing clothing sales: by brand name, size, color, location, advertising, and so on. If you fill a multidimensional hypercube with this data, viewing it from any 2-D perspective -- n -dimensional hypercubes have n *( n -1) sides, or views -- will be easy and fast.

That's the appeal of products from vendors such as Arbor Software (Essbase), Comshare (Commander OLAP), Oracle/IRI Software (Express EIS), and Pilot Software (Lightship). Karl Stephan, a senior financial manager in Sears's planning analysis section, remembers when it took hours to assemble the data he needed i nto a spreadsheet. Using Essbase, it takes just seconds. Delta Airlines has used OLAP to gain insights into its frequent-flier program. Using Planning Sciences' Gentium, it's consolidated data from a 100-GB Teradata database into six far more accessible multidimensional databases totaling a mere 6 GB.

OLAP servers are great for time-series analyses, recursive calculations (e.g., how to allocate overhead as a percent of revenue contribution by product line), and data with up to about 15 dimensions. Beyond that, most multidimensional servers fail under the sheer weight of their own indexes. Michael Saylor, president of OLAP vendor MicroStrategy, segments the DM market into three parts. He recommends spreadsheets and query tools for slice-and-dice data mining on databases of up to about 1 GB, departmental OLAP servers for up to about 20 GB, and enterprise warehouses for anything above that.

George Zagelow, program manager for data-warehousing solutions at IBM, concurs that most businesses need more t han a single DM tool. "Multidimensional databases, OLAP products, DM, and traditional decision-support tools all belong in your toolbox right alongside standard relational databases."

For example, rather than use an OLAP or hypercube tool, you're better off creating a warehouse using a relational database if you have lots of data or are facing complex loading and consolidation from multiple data sources. Why? Because there's a mature utility market to support those activities. However, don't expect mining operations that represent joins across many multirow tables to be fast. That's where OLAP servers shine, providing blindingly fast results to queries along predefined dimensions.

Hypercubes vs. Killer Queries

Let's say you decide to mine your existing database to find the customers most likely to respond to a mail-order promotion. You might try using a query-and-reporting tool such as Information Builders' Focus Reporter or Software AG's Esperant to construct the SQL quer y, "How many credit-card customers who made purchases of over $100 on sporting goods in August have at least $2000 of available credit?" If the number is too big, you might refine it: "Narrow it down to customers under 40 who live within 30 miles of a store in a coastal state."

Although you can construct queries such as these using query-and-reporting tools that work with relational databases, such unfettered querying can bring a production system to its knees. That's why DM operations are usually made against data that's been warehoused, either in a traditional relational database or consolidated into a multidimensional hypercube.

Although the relational/OLAP wars may continue for another year, chances are we'll see some convergence. Already, relational warehouses and virtual OLAP servers based on the relational model are adding support for star schemata. The idea is to mimic multidimensionality by creating special tables that contain roll-up data. For example, you might have a central fact table with sales data, surrounded by star tables with location, time, and product data. An innovation from Cross/Z International uses fractals to store warehouse data. The idea behind Fractal Database Mining System is to provide OLAP-style responses for huge data warehouses.

Because adding intelligent-agent capability to software isn't that complicated, most of today's OLAP products and query-and-reporting tools (e.g., Brio Technology's BrioQuery, Comshare's Commander OLAP with Detect and Alert, Information Advantage's NewsLine 3.0, and Trinzic's Forest & Trees) have this sort of intelligent agent built in. Comshare's Detect and Alert "robots" monitor news feeds or even Lotus Notes databases for keywords and stock quotes from Dow Jones News/Retrieval for predefined values.

With Forest & Trees, an administrator sets up alarms for trigger values. Many warehouse and middleware products (e.g., Trinzic's InfoPump) include intelligent agents that schedule data transfer from production systems to decision-sup port databases.

Mining with Query Tools

If you don't have an OLAP server or an enterprise data warehouse, don't despair. Lots of mining can be done from the desktop using client/server generation query-and-reporting tools. With many products available (e.g., Business Objects, Powersoft's InfoMaker, and Crystal Service's Crystal Reports), it's hard to differentiate among them. They range from traditional spreadsheets to products from vendors such as IQ Software and Cognos that provide strong support for MIS oversight.

Most of these tools come with graphing components. Some even support a degree of multidimensionality, such as pivoting, intelligent drilling, crosstab reporting, and time-series analysis. A few are beginning to offer easy-to-use intelligent-agent support (versus alerts that can be established programmatically). If you need to select a new query-and-reporting tool and need to support a mixed environment of PCs and Macs, be sure to make that a feature on your ch ecklist.

You should think of query-and-reporting tools as generic mining tools. They generally support direct access to source data and may offer cross-database joins, but their unbridled use can wreak havoc with production systems. And, given the challenges of performing joins across systems, it may be hard for end users to know if the answer they're getting is accurate.

Query tools can be used for interactive exploration, especially against relational data. Most query tools construct SQL queries for the data miner and can be slow if the source data is scattered among many tables -- especially large ones, on multiple databases, that are poorly indexed.

Tooling Up

DM is such a hot concept that it's showing up in nonspecialized tools. For example, high-end financial and statistical analysis, decision-support, and EIS vendors are adding DM capability (or, at least, labeling) to their products. Also, enterprise database vendors such as IBM and Hewlett-Packard are creati ng data-warehouse suites (IBM's Visual Warehouse) or virtual-warehouse frameworks (HP's Open Warehouse) that include -- you guessed it -- DM tools.

Converging from the desktop are spreadsheets and query-and-reporting tools associated with client/server applications. These tend to be high-touch tools, although many let users set up hands-off intelligent-agent alerts.

IntelligenceWare's Kamran Parsaye views DM and decision support as a set of fairly distinct spaces, each with its own set of algorithms. This includes an aggregation space containing precomputed OLAP data (to answer such questions as "What is the trend in Joe's sales by product and by month compared with average sales figures?"), an influence or discovery space (where relationships are discovered and refined), and a related variation space (for questions such as "How have weekly changes in prices varied over the last year?").

Because information-discovery tools have only recently gained widespread attention as DM tools, they stil l tend to be rather technical and best suited for analysts with strong mathematical backgrounds. Look for explosive growth in this area of DM tools as better user interfaces make them easier for end users to harness. As for intelligent agents, especially agents as Internet gofers and E-mail filters, within a year, you'll wonder how you ever lived without them.

The popularity of DM shows that businesses are looking for new ways to let end users find the data they need to make decisions, serve customers, and gain a competitive advantage. If your workers aren't asking for better mining tools, you'd better ask why.


WHERE TO FIND

Arbor Software Corp.
Sunnyvale, CA
(800) 858-1666
(408) 727-5800
fax: (408) 727-7140

Brio Technology, Inc.
Mountain View, CA
(800) 486-2746
(415) 961-4110
fax: (415) 961-4572

Business Objects, Inc.
Cupertino, CA
(800) 703-1515
(408) 973-9300
fax: (408) 973-1057

Cognos, Inc.
Burlington, MA
(800) 426-4667
(617) 229-6600
fax: (617) 229-9828

Comshare
Ann Arbor, MI
(800) 922-7979
(313) 994-4800
fax: (313) 769-6943
info@comshare.com
http://www.comshare.com

Cross/Z International, Inc.
Uniondale, NY
(510) 522-4000
fax: (516) 228-8584
crossz@netcom.com

Crystal Services, Inc.
Vancouver, British Columbia, Canada
(800) 877-2340
(604) 681-3435
fax: (604) 681-2934

Hewlett-Packard Co.
Cupertino, CA
(800) 637-7740
(408) 725-8900
fax: (408) 447-4458
http://www.hp.com

IBM
Somers, NY
(800) 547-1283
http://www.ibm.com

Information Advantage
Edina, MN
(800) 959-6527
(612) 820-0702
fax: (612) 820-0712
marketing@infoadv.mn.org

Information Builders, Inc.
New York, NY
(800) 969-4636
(212) 736-4433
fax: (212) 564-1726
info@ibi.com
http://www.ibi.com

IntelligenceWare
Los Angeles, CA
(800) 888-2996
(310) 216-6177
fax: (310) 417-8897
datamine@ix.netcom.com

IQ Software Co.
Norcross, GA
(800) 458-0386
(404) 446-8880
fax: (404) 448-4088
sales@mhs.iqsc.com

MicroStrategy
Vienna, VA
(800) 927-1868
(703) 848-8600
fax: (703) 848-8610
info@strategy.com

MIT GmbH
Aachen, Germany
+49 2408 9458 11
fax: +49 2408 948 2
info@mitgmbH.de

Oracle/IRI Software
Waltham, MA
(800) 765-7227
(617) 890-1100
fax: (617) 890-4660
iri.software@infores.com

Pilot Software
Cambridge, MA
(800) 944-0094
(617) 374-9400
fax: (617) 374-1110

Planning Sciences, Inc.
Littleton, CO
(303) 794-8701
fax: (303) 794-8702

Powersoft Corp.
Concord, MA
(800) 395-3525
(508) 287-1500
fax: (508) 287-1600
http://www.powersoft.com

Reduct Systems, Inc.
Regina, Saskatchewan, Canada
(306) 586-9408
fax: (306) 586-9442

Software AG
Reston, VA
(800) 423-2227
(703) 860-5050
fax: (703) 391-6731
http://www.sagus.com

Trinzic Corp.
Redwood City, CA
(415) 591-8200
fax: (415) 594-8645

Selecting Data-Mining Tools


Intelligent agents require some expertise to set up but need little direction. Some work directly on text. Best used for turning up
unsuspected relationships. Don't turn these things loose on production systems, though.

Multidimensional-analysis (MDA) tools can have simple graphical
interfaces for nonexpert use. Work on databases but really zip with
multidimensional hypercubes of data (special setup). Best use: iterative, interactive, hands-on exploration of data.

Query-and-reporting tools require close direction to frame queries (many simplify the process with graphical interfaces). Require a
database structure. Best use: asking specific questions to verify
hypotheses. Legendary for their ability to bog down production systems.

Intelligent Agents

illustration_link (22 Kbytes)

Intelligent agents can sift bales of point-of-sales records, hypothesize c onnections, and report discoveries.


Multidimensional Analysis

illustration_link (12 Kbytes)

Multidimensional analysis supports interactive scrutiny of data, refining the focus and testing ideas.


Karen Watterson is an independent San Diego-based writer and consultant specializing in client/server issues. She is the editor of two newsletters and the author of Visual Basic Database Programming and Client/Server Technology for Managers (both from Addison-Wesley). You can reach her at 1119390@mcimail.com .

Up to the State Of The Art section contentsGo to previous article: FinanceGo to next article: Data-Mining Dynamite
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network