t way to index unstructured data that would work on a global scale? Can we build a search engine like InfoSeek, Lycos, or AltaVista for i
mages and sounds, for instance?
The answer to each question is yes, although there are significant limitations to search-engine technology, some inherent and some perhaps surmountable in time.
Manual Problems
Traditional database queries look for matches between search criteria and structured keys, the alpha, numeric, date, or time fields that point to a data record. The same approach works for unstructured data, such as images and sounds, but only if an operator enters one or more keys for each item. For instance, the operator might type in a catalog reference number, a caption, or a number of controlled-vocabulary keywords.
Manually entered keys are simple and effective for most databases, and they are still the dominant approach to indexing unstructured data. However, they do have drawbacks: Reviewing the data (e.g., looking at each image or listening to each sound) and then entering the key information is a labor-intensive and error-prone process. De
pending on the nature of the data, you may need trained operators just to enter keys that are both correct and consistent. Finally, indexers have to know in advance which characteristics are important. The subject of the picture? The geographic location? The name of the photographer?
Content Queries
The alternative to the key-based query is the content-based query, which relies on the computer to examine the data object and report on its attributes, such as color, texture, or shape in an image or tonality or rhythm in an audio clip. Content-based queries have been used by NASA to scan images from satellites and by PhotoDisc to index its catalog of tens of thousands of stock digital images. NASA used IBM's Query by
Image Content
(QBIC) technology, while PhotoDisc used Virtual Information Retrieval, from
Virage
(San Mateo, CA). Both are commercial products, although they may require customization or integration for individual applications.
Probably the most intractable problem with these products is their limited ability to tell you the things about data objects that you really want to know. For instance, none can tell you the subject of an image. If you want to find images of dogs, you'll need to search your image database by color, shape, and texture. Along with canines, your search is also likely to net a lot of cats, horses, cows, and perhaps a few chairs and automobiles.
Of course, key-based search engines have much the same problem: If you search on the word "knuckles," you'll find the term has one meaning to a human anatomist, another to a shipbuilder or architect, and still another to a butcher. With structured data, you may be able to formulate a query that eliminates the meanings you don't want (e.g., knuckles+human). If you're dealing only with unstructured data, it's not possible to impose limits based on the actual content of the picture because the retrieval engine deals only with mathematical patterns of color, shape, a
nd texture, not with semantics (that is, the meaning of the content to a human observer).
Content-Based Work-Arounds
A number of techniques can make content-based querying useful despite its semantic ignorance. One approach involves multiple search cycles. The first time you search, you select an image that closely approximates what you're looking for. The search engine then compares subsequent images to that image.
Another approach uses metadata, information gathered about the image when it was captured, such as file size, time and date, type of equipment used, and so on. If you're looking for a picture of a news event that took place in 1995, it's no use searching images captured before or after that date.
Mainly for military applications, Hughes Aircraft has developed algorithms for creating "smart metadata," which does deal with the semantics of images. Historically, these techniques have been applied primarily to very small sets of data. However, improvements in algor
ithms and new, faster, and lower-cost hardware could make these techniques increasingly relevant for mass screenings of database images, according to Hughes. So far, however, there are no commercial applications of smart metadata techniques.
For now, the most practical approach is one that combines metadata and what could be called "manual metadata." For example, to make complex data more accessible, you might automatically record the date, file size, and resolution for each object you add to your database. You can then search for relevant attributes, such as resolution. You can hone your search by typing in a few indexing keys, such as "dog," "cat," "horse." These steps will reduce the cost of manual indexing since you can count on the automated scan to distinguish brown dogs from black dogs and probably (perhaps with a multipass search) Siamese cats from tabbies. Automated retrieval methods work much better as you narrow the field of possibilities.
In addition to making queries more efficient an
d accurate, these techniques also address the problem of overwhelming the network or the querying computer with the results of the query. Another helpful technique is the creation of a "miniature" version of the data, such as a thumbnail of an image, which the operator can examine to further narrow the universe of search possibilities.
Resource Expenses
Even with all these optimizations, queries of unstructured data tend to be expensive in terms of human effort, network resources, and processing power. "The average business manager will not do it," says Robert Rose, director of product strategy for Cognos, a maker of business intelligence tools (including PowerPlay, a multidimensional data exploration tool, and Impromptu, a high-end tool for complex queries). Cognos has been tracking query-by-content technology but has not incorporated it into its products. For the vast majority of business applications, says Rose, the best strategy is still to associate a particular image with a pa
rticular report or record. Users search for the topic of the report or the key to the record (such as an employee's name) and they get the unstructured data (such as a picture of the employee) thrown into the bargain.
Rose says he has not yet seen broad-based demand for query-by-content capabilities. However, he believes such a demand may emerge as users seek some practical means to search the vast data resources of the Web. "The Web is where supply-side technology lives," says Rose.
screen_link (58 Kbytes)

Wuery by Image Content (QBIC), developed by IBM, helps realtors, NASA, and others find data objects in image database
s.
screen_link (29 Kbytes)

Virage's software looks for mathematical patterns to identify images.
Mike Hurwicz (
mhurwicz@attmail.com
) is a freelance writer based in New York City.