Statistical-pattern recognition could be the underpinning for tomorrow's smarter voice processors and machine-vision systems
John L. Cuadrado
Dense fog shrouds your reconnaissance plane as it reaches the target area. You can't visually inspect the airfield below, but the SAR (synthetic aperture radar) system in your plane's cockpit reports four large aircraft parked on the tarmac. Unfortunately, the scattered radar signal can't tell you if the aircraft belie a heightening of tensions (fighters and bombers) or retrenchment (troop carriers). Your objective is to identify the planes on the ground or at least determine to which class they belong.
The key to answering these questions is statistical-pattern recognition, which can process raw radar data into feature sets, called vectors, that can help c
lassify the planes. This hypothetical problem is drawn from the military, a sector where much of the work in statistical-pattern recognition has taken place. But statistical-pattern recognition development is active today in the academic and the industrial worlds, where researchers from different disciplines--from document processing to engineering to medicine--are creating and applying new, more powerful techniques. As a result, the technology that will help make tomorrow's handwriting-recognition products, speech processors, or machine-vision systems more accurate will likely have roots in statistical-pattern recognition.
Today's activities are fueled by two catalysts: First, solutions to the recognition problems in these areas rest with the accurate classification of data, which is the forte of statistical-pattern recognition. Second, today's relatively inexpensive hardware brings the requisite processing horsepower to the desktop and factory floor. Here's how statistical-pattern recognition works.
Aircraft ID
Think of the radar data in the hypothetical example as a bit map that contains all the pictorial information about the airfield (see ``
Five Steps to Identifying an Aircraft
''). This type of image serves as input to various segmentation algorithms that partition the image into components. These components are further identified as possible candidates that represent objects of interest--subimages of aircraft in the above example.
The system extracts a set of measurements from each of the component subimages. For example, it might compute the centroid of the subimage or the largest and shortest axes. These measurements are referred to as the features of an object in the context of pattern recognition. The result of this processing is an n-dimensional vector of features (i.e., each detected subimage is characterized by one of these feature vectors). The number of features and what they represent varies with each problem.
You now have a fe
ature vector that might indicate the presence of a type of aircraft. This feature vector is data you input into the pattern-recognition system. The system's goal is to determine whether the feature vector belongs to one of several classifications that the system knows about. The final output of the system might be an image of the airfield with a tag next to each of the planes indicating their type.
The Basic Problem
The issues that your pattern-recognition system must address include how the system distinguishes plane types, how you train the system to identify different types of aircraft classes, how to make the classification system accurate, how computationally intense are the underlying algorithms, and how the system classifies planes in real time. If you abstract the SAR problem, you will see that the pattern-recognition challenge consists of partitioning a vector space F (feature vectors) into the various aircraft types. Each vector in F consists of a set of features that d
escribe characteristics of the objects.Two pattern-recognition approaches address these issues.
In one approach, you have a set of feature vectors that characterize some objects, and your goal is to find out if some partition of the space F groups these objects into any meaningful sets. This is known as the cluster problem. By contrast, this article deals with the second approach, where you are given a partition of the space F into a set of classes and a new vector. Called the classification problem, this approach determines to which of the given classes a new feature vector belongs.
To delve further into the classification problem, consider an example that was used in the 1930s by R. A. Fisher to develop his theory of discriminant functions. Here, the objects to be classified are sets of flowers called irises. There are three classes of irises: Iris Setosa, Iris Versicolor, and Iris Virginica. Botanists had studied these flowers and made many measurements on such things as petal and sepal lengt
h and width. Therefore, data exists to characterize 4-D vectors where the first two components correspond to petal length and width and the last two components correspond to sepal length and width. To simplify the problem even more, just consider the petal information. The final vector is 2-D: v = (x1, x2), in which the first component corresponds to petal length and the second to petal width.
For each vector, you know to which class of iris the vector measurements belong. Further, you estimate a probability function P and assign a number between 0 and 1 to each of the three iris classes. These functions are known as the prior probabilities. You can estimate their values based on sample data. For example, to assign a value to P(Versicolor), you count the number of flowers you are given in the Versicolor class and divide by the total number of flowers from all three classes. You need to devise a procedure that will let you determine the class of a new, unknown vector (see the figure ``
Iris
Classes
'').
Many techniques have been developed to solve problems like this one. These techniques can be broadly categorized as either parametric or nonparametric. In the parametric case, you assume that the underlying population from which the data comes has some known probability distribution that depends on some parameters. In this case, the goal is to determine, or estimate, these parameters from the sample data that you are given (i.e., the 2-D petal measurements on known examples of the three classes of irises). By contrast, nonparametric methods come into play when the data has no known limits.
Now, given a feature vector v, you can ask what the conditional probabilities P(Setosa vertical bar v), P(Versicolor vertical bar v), and P(Virginica vertical bar v) are. Think of these probabilities as saying, given a feature vector v, what is the probability that it comes from one of the three classes? If you know these conditional probabilities, you can solve the problem. The classification
rule is simple: If you are trying to classify a new vector v, you simply compute the conditional probabilities for each of the classes. The class that results in the largest number ``wins,'' and you classify v as belonging to that class. It turns out that this is the best you can do if the goal is to minimize the total error of classification. This classification rule is known as Bayes' rule, and it forms the starting point for most other approaches.
The problem with Bayes' rule is that you need to know the conditional probabilities P(Setosa vertical bar v), P(Versicolor vertical bar v), and P(Virginica vertical bar v), and these are almost never available. A much easier set of numbers to obtain is P(v vertical bar Setosa), P(v vertical bar Versicolor), and P(v vertical bar Virginica). These conditional probabilities represent the likelihood of having a certain set of features given that the vector v comes from the corresponding class. These conditional probabilities can be estimated by sampling the o
bjects from each of the classes. In practice, a large number of samples from each class would be examined to isolate characteristics of particular common features that are specific to each class.
In theory, you could compute tables for conditional probabilities P(v vertical bar Setosa), P(v vertical bar Versicolor), and P(v vertical bar Virginica). Unfortunately, what you want is P(Setosa vertical bar v), P(Versicolor vertical bar v), and P(Virginica vertical bar v). But not to worry; you can use a form of Bayes' rule that requires that you assign the new feature vector v to the class that produces the largest of the following three numbers:
P(v vertical bar Setosa) P(Setosa)
P(v vertical bar Versicolor) P(Versicolor)
P(v vertical bar Virginica) P(Virginica)
This is clean and simple, but in the real world, obtaining even these conditional probabilities can be difficult. In practice, the most common approach is to assume that the underlying probability dist
ribution is normal (i.e., the familiar bell-shaped curve). This distribution is determined by two parameters, the mean and the variance (or covariance matrix).
To estimate the mean and the variance of the iris population, you need large samples of data for each iris class. Many techniques exist for estimating these parameters. Assuming you're armed with the values of the estimated mean and covariance, you can construct the discriminant functions that will serve as classifiers for any new feature vector. In the example previously described, you are given a new iris, and you measure its petal's length and width. This produces a new vector that you input into your classifier. The result is an assignment of the new flower to one of the three classes of irises.
Many other techniques exist for solving the classification problem. Among the more common nonparametric techniques is the k-Nearest Neighbor method. The basic premise of this algorithm involves the use of centroids. Again, to keep things simpl
e, assume that you have computed the centroids of each of the three classes of irises. You have found three representative vectors--v1, v2, and v3--that are typical of each of the three classes (see the figure ``
Nearest Neighbor Classifier
''). Given a new flower, you again measure its petal's length and width and form a new vector v. You compute the distance from this new vector to each of the class centroid vectors v1, v2, and v3 (distance can be defined in many ways). You assign the new flower to the class with the smallest distance from one of your classification vectors. The simplest definition in the current case is the standard Euclidean distance, but if you want to take into account the probabilistic information available in the data, then there are much better choices.
Applications
In medicine, one of the main problems is to assign a set of signs and symptoms to a disease category. Here statistical-classification techniques are designed using a combi
nation of medical theory and empirical knowledge.
Another area where these techniques are used is in engineering diagnostics. Here, you have theory-based and empirical classifiers. For example, in many complex devices, the possible failure modes may run into the millions. A brute force enumeration of such modes and attending characterizations is clearly prohibitive. Thus, in such cases, the system is observed in operation under various typical conditions, and occurring failures and their causes are analyzed to develop failure classification data.
These failure modes are characterized by feature vectors along the line of the iris examples. Using this empirical data, classifiers can be developed so that when the device is out on the field and a new failure occurs the feature vector can be produced and quickly classified. In the future, algorithm development for handwriting recognition and other areas will propel classification techniques into more business and general-purpose applications.
BIBLIOGRAPHY
Fisher, R. A. ``The use of multiple measurements in taxonomic problems.'' Annals of Eugenics, vol. 7, part II, 1936, p. 179.
Huberty, Carl J. Applied Discriminant Analysis. New York: John Wiley, 1994.
Michie, D., D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural Networks and Statistical Classification. London: Ellis Horwood, 1994.
Panel on Discriminant Analysis, Classification, and Clustering. Discriminant Analysis and Clustering, Statistical Science, vol. 4, no.1, 1989, p. 34.
illustration_link (10 Kbytes)
In a hypothetical military application of statistical-pattern recognition, radar data informs the system that four aircraft of unidentified class sit on an airfield tarmac.
Step 2
illustrati
on_link (9 Kbytes)
The aircraft images become input to various segmentation algorithms that divide the image into components.
Step 3
illustration_link (22 Kbytes)
After taking measurements of subimage components (e.g., the largest and shortest axes), the recognition system develops a set of feature vectors, which are the keys to unlocking each aircraft's identity.
Step 4
illustration_link (15 Kbytes)
The system compares feature vectors of the unidentified aircraft with specifications of known aircraft held in the system's database.
Step 5
illustration_link (18 Kbytes)
After matching the feature vectors of known and unknown air
craft, the system classifies the images captured by radar.
illustration_link (31 Kbytes)
The classification approach to statistical-pattern recognition works to determine how a new value corresponds to known classes. For example, the stars, dots, and x's above correspond to the lengths and widths of three classes of iris.
illustration_link (31 Kbytes)
To determine which iris class a new flower belongs, compare its measurements to the centroid vectors in the three known iris classes. In this approach, you assign the new flower to the class with the shortest centroid distance.
John Cuadrado is an independent consultant in Maine. His primary areas of interest incl
ude distributed database systems, applied AI, and theories of parallel computation. You can reach him on the Internet or BIX c/o
editors@bix.com
.