Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesSearch Again


Janu ary 1997 / Web Project / Search Again

Move beyond the basics to take advantage of query by example, concept searching, and field indexing.

Jon Udell

In my September 1995 column ("Web Search"), I showed how to add basic indexing and search functions to a Web site (see http://www.byte.com/art/9509/sec9/art1.htm ). The freely available tools I discussed there -- freeWAIS (Wide Area Information Service) and Simple Web Indexing System for Humans (SWISH) -- have served the BYTE Site well and continue to support thousands of searches every day. However, the site's growth in size and complexity mandates a more sophisticated search capability tha n these basic tools can easily provide.

Therefore, I've been testing a number of indexing and search tools -- WebGlimpse, Verity's TopicSearch, InMagic's DB/Text WebServer, Digital Equipment's AltaVista Private Extensions, Excite for Web Servers (EWS), and the Microsoft Index Server (MSIS). So far, I've added search functions to the public BYTE site using the latter two engines. Here's how.

Implementing EWS

Excite (formerly Architext) makes EWS freely available for several flavors of Unix and Windows NT; I'm running it on NT 3.51. At its core are two stand-alone programs -- architextindex and architext search. You needn't touch them, though, because EWS comes with Perl wrappers for them (and a copy of Perl 5 to execute the wrappers). You needn't touch the Perl wrappers either, because they are in turn wrapped in a layer of Hypertext Markup Language (HTML) so that administration as well as use of EWS is Web-driven.

The Web-oriented administrati ve style of EWS and other second-generation tools (including the Verity and Microsoft products) has two main advantages -- ease of use and remote access. Ease of use is a slippery concept. It's certainly true that you can build your first index more easily when the tools needed to configure and run the indexer are embedded in HTML forms that explain how to use the tools. However, this interactive mode becomes a hindrance when you move from prototyping to production: That's because a URL-driven (uniform resource locator) tool is far more difficult to integrate into scheduled and scripted batch processes than is a command-line-driven tool.

With EWS, you can bypass the Web interface and use the Perl wrappers or underlying programs directly. This method isn't documented, but at least it's available. Other implementations foreclose that option entirely. Verity's freeware version of TopicSearch, for example, runs only as an Internet Server API (ISAPI) DLL attached to Microsoft's Internet Information Server (IIS ). To automate use of its indexer, you would need to apply the technique I described in my November 1996 column ("On-Line Componentware") -- discover the URL-based API implicit in the Web interface and program to that API using a library that gives scripts the ability to "call" Common Gateway Interface (CGI) URLs.

Silly Claims

Considering all that utter rigmarole, the notion of just adding a command plus some arguments to your /etc/crontab file (or its NT equivalent) seems rather appealing. Also, the ease-of-use claim that some vendors make on behalf of their Web-integrated solutions can appear somewhat silly. Web-based control of indexers or other kinds of applications can be a useful adjunct to conventional methods, but it's not necessarily an appropriate replacement for them.

Advanced Indexing

If your site is even moderately complex, you probably won't want to just point an indexer at the Web server's root and let it rip. You' ll likely have distinct subtrees that you want to index. Within those subtrees, you'll want to include some classes of files but not others. Some indexers (e.g., the Microsoft product) can't exclude files at all. If you want to restrict search results, you'll have to parse and filter them on the fly. That's doable, but it's complex and computationally expensive. If there are subdirectories and files you don't want people to find, it's best to leave them out of your index.

The first engine I implemented on the BYTE Site, freeWAIS, has only a weak exclusion mechanism. You can exclude wild-carded filenames but not wild-carded paths. Because I need to do the latter, my indexing scripts for freeWAIS are too verbose. They enumerate long lists of subtrees for inclusion, rather than short lists of subtrees to include along with short lists of patterns to exclude.

SWISH is more agile. I can use the rule PATHNAME CONTAINS IMG to exclude the dozens of directories in which I store the HTML wrappers for BYTE Site images. These wrappers, which would otherwise be included in the index, contain hardly any useful text and are best left out.

If you're somebody who needs additional flexibility, EWS's exclusion mechanism is even more agile. Unlike SWISH, EWS lets you describe excluded items not only with wild cards, but with full-blown regular expressions. Why would you need to do that?

Consider the BYTE Site's conferencing application I discussed in last month's column, "Dual-Mode Conferencing." It generates multiple versions of each message to support both frame-based and frameless viewing. However, you probably don't want to see two copies of each message in the search results list, so I decided to exclude the frame-based set. When I indexed the conferences with SWISH, I found that I couldn't differentiate between the two classes of files using simple wild cards. Fortunately, EWS's regular-expression capability solves this sort of problem neatly and quickly.

Best-Kept Secret

The odds are t hat if you've tried EWS on the BYTE Site or another of the sites where it runs, you've missed its best feature: query by example, or QBE (see the figure "Sophisticated Searches" ). To try it for yourself, go to the BYTE Site and use EWS to search for Cyberdog. As you would expect, a list of clearly Web-related titles comes back.

What you might not expect, though, is that if you have used the default concept search setting, most of these articles won't include the word Cyberdog . They will, however, contain sets of terms (e.g., browser , HTTP , and OpenDoc ) that correlate statistically with the few articles that mention Cyberdog. If you're interested in how OpenDoc relates to Cyberdog, or if you hadn't even known that it did, you will appreciate EWS automatically making that connection for you. That's what EWS means by the notion of concept search -- and it's not even the best feature that users often don't get.

Here's the best feature: Click on the red or black icon that introduces one of the documents in the Cyberdog result set whose title includes the term OpenDoc . This action says: "Find similar articles." The new result set will contain many more OpenDoc-related titles. If the example didn't contain Cyberdog, the new result set will have taken you on a quite different tack from the original query.

This refocusing mechanism can be incredibly useful when you're doing research. You naturally want to follow a branching path through conceptual space. After scanning a few of the articles in the OpenDoc result set, you might want to focus specifically on comparisons between OpenDoc and OLE.

If you click on the icon that goes with the likeliest candidate, EWS will return a third result set in which the OLE theme is more prominent. Every article on every search-results page is itself an implicit query -- a single-click accessor of a set of related articles.

Once you discover this principle, it transforms how you explore a document colle ction. You don't have to worry about forming exactly the right search expression. Just seed the process with words that get you conceptually near what you're looking for and let QBE automatically feed statistical profiles back into the searcher as you click your way through a series of refinements.

It's brilliant, it's effective, and it's trivial to operate, yet many users never discover QBE even though every search page says "Click on the red or black icons to search for similar articles." I know this because I hacked the Perl wrapper to log search terms. (I always do this because analyzing what people search for tells me a lot about what kinds of information we ought to be providing.) Fewer than 5 percent of the first several thousand EWS queries I logged were of the QBE flavor, and many of these were my own tests. Moreover, an informal poll of BYTE staffers showed that while many had encountered EWS in their travels on the Web, none of them had discovered QBE.

"It's a problem," agrees Graham Spence r, chief technology officer for Excite. "In academic information retrieval, the average search expression is 12 to 15 terms long; on the Web, it's 1.5 terms." EWS constructs those 15-term expressions for you automatically when you use QBE. That users often don't realize this is partly a failure of user-interface design -- the icons could be bigger, the instructions more prominent. But it's also a failure of expectations. Web surfers accustomed to more conventional search technology just don't expect EWS to do what it does.

Microsoft Index Server

As nifty as EWS is, it lacks four desirable features: a high-performance architecture, phrase and proximity search, field indexing and searching, and automatic indexing on demand. MSIS, though weak in some areas, offers these four features. To try it, you have to join a fairly exclusive club. MSIS doesn't just require NT and IIS; it demands NT 4.0 and IIS 2.0.

You're not prepared to obsolete a stable NT 3.51 production server just for this purpose? Neither was I, so I ran MSIS on a development server, pointed it at an HTML collection on the production server, and tweaked the result URLs to refer to actual files on the production server instead of nonexistent ones on the test machine.

Both EWS and MSIS are running this way -- as true distributed services off-loaded from the primary production server, linked to the site by means of URLs. It's easy to create this kind of distributed search capability with MSIS, because it automatically finds and offers to index any virtual directories mounted on IIS. If the MSIS/IIS machine and the document server live in the same NT domain, MSIS can index the remote document server.

There's one catch, though. IIS needs a user name and password to mount the virtual directory. After I supplied these credentials, browsers talking to IIS could read documents in and below that directory. But MSIS couldn't index them. Even though it runs as an ISAPI extension to IIS, it has its own notion of access credentials . I had to configure the IIS mount not just with a user name and password, but more specifically with a domain name\user name and password. Then it worked.

Implementing MSIS

For querying, MSIS uses the Internet Database Connector model again. In this scheme, an HTML form refers to a query-configuration (IDQ) file that names the index to search, enumerates which fields to return, and describes how to order those fields. The IDQ file also names an HTML template (HTX) file that will format the query results. To redirect the result URLs to the production server, I replaced occurrences of <%server_name%> in the HTX file with www.byte.com .

As with the Internet Database Connector, you can use other predefined variables with a simple IF ... THEN syntax to reformulate the result set (e.g., to chunk a long list of result URLs into a linked series of HTML pages). The HTX language is not powerful enough, however, to achieve the standard BYTE Site presentation of search results. I use Perl to capture document titles emitted by SWISH, parse out an issue-date field (e.g., February 1996), and sort in reverse chronological order.

Several areas of the BYTE Site cry out for field indexing. In the Virtual Press Room, you should be able to do a field (rather than full-text) search for company and product names. In the conferences, similarly, you should be able to search author and subject fields. If you use meta tags to create fields in your HTML document headers, MSIS automatically uses them to create field indexes. For example, a VPR document header looks like this:

<html><head>
<meta name="company" content=
"SunSoft">
<meta name="product" content=
"Java Workshop">

When I indexed the VPR collection, MSIS constructed company and product indexes. To use them, I had to add this incantation to my IDQ file:

MetaCompany(DBTYPE
_STR) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 company
MetaProduct(DBTYPE
_STR) = d1b5d3f0-c0b3-11cf
-9a92-00a0c908dbf1 product

Then, I could issue the query

@MetaCompany SunSoft

to find all SunSoft press releases. What a great idea! Field indexing adds a new dimension to the full-text indexing so common on the Web. It's rarely done for two reasons: Indexers often don't support it, and document collections often don't provide fielded content. Leveraging meta tags as MSIS does is the right way to advance the cause of field indexing. Other engines, including Netscape's Catalog Server and the high-end version of Verity's TopicSearch, can also exploit meta tags.

Unfortunately, MSIS in its current form (version 1.1) can't read or manipulate the contents of these user-defined fields. So while you can search for SunSoft press releases, you can't write an HTX file that sorts the results by product name. And you can't even use the HTX file to print the values of the company and product fields.

The Ultimate Engine?

Basic though SWISH is, I continue to get a lot of mileage out of it. In my view, there's no perfect search engine. If you haven't indexed your site yet, don't get too hung up on choosing the ultimate do-everything tool. Focus instead on tagging your data in ways that let you organize search results in useful ways. Meta tags are a great way to instrument your content so that results returned from any search engine can be sorted by date or category. Once that's done, you can easily replace a basic search tool with a fancier one.


TOOLWATCH


Internet Music Kit...................$49

Wildcat Canyon Software
Internet: 
http://www.wildcat.com/

Embed MIDI files on your Web pages for streaming playback 
using a Netscape plug-in or an ActiveX control.


BOOKNOTE


CORBA: A Guide to the Common Object Request Broker Architecture
 
by Ron Ben-Natan

Price:     $45
A useful guide to interface definition language, object 
services, and object database management.


Where to Find

WebGlimpse                                
http://donkey.cs.arizona.edu/webglimpse/

Verity's TopicSearch
                      
http://www.verity.com/

InMagic's DB/Text WebServer               
http://www.inmagic.com/

Digital's AltaVista Private Extensions    
http://altavista.software.digital.com/

Excite for Web Servers                    
http://www.excite.com/

Microsoft Index Server                    
http://www.microsoft.com/internet/


HotBYTEs
 - information on products covered or advertised in BYTE


Indexing and Search Tools


Benefit

In general, the tools make it easier to create 
first-time indexes of your site.


Problem

The tools can become a hindrance when you go from 
prototyping to production.


Advice

Don't expect to find an ultimate do-everything tool.
Focus instead on tagging your data to organize search 
results in useful ways.


Products T
ested at the BYTE Site

WebGlimpse
Verity's TopicSearch
InMagic's DB/Text WebServer
Digital Equipment's Alta Vista Private Extensions
Excite for Web Servers
Microsoft Index Server




Sophisticated Searches

illustration_link (49 Kbytes)

Query by example is effective and trivial to operate. Unfortunately, many people never use it for more effective Web searches.


CORBA

photo_link (33 Kbytes)


Jon Udell ( judell@bix.com ) is BYTE's executive editor for new media.

Up to the Web Project section contentsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network