Server (MSIS). So far, I've added search functions to the public BYTE site using the latter two engines. Here's how.
Implementing EWS
Excite (formerly Architext) makes EWS freely available for several flavors of Unix and Windows NT; I'm running it on NT 3.51. At its core are two stand-alone programs -- architextindex and architext search. You needn't touch them, though, because EWS comes with Perl wrappers for them (and a copy of Perl 5 to execute the wrappers). You needn't touch the Perl wrappers either, because they are in turn wrapped in a layer of Hypertext Markup Language (HTML) so that administration as well as use of EWS is Web-driven.
The Web-oriented administrati
ve style of EWS and other second-generation tools (including the Verity and Microsoft products) has two main advantages -- ease of use and remote access. Ease of use is a slippery concept. It's certainly true that you can build your first index more easily when the tools needed to configure and run the indexer are embedded in HTML forms that explain how to use the tools. However, this interactive mode becomes a hindrance when you move from prototyping to production: That's because a URL-driven (uniform resource locator) tool is far more difficult to integrate into scheduled and scripted batch processes than is a command-line-driven tool.
With EWS, you can bypass the Web interface and use the Perl wrappers or underlying programs directly. This method isn't documented, but at least it's available. Other implementations foreclose that option entirely. Verity's freeware version of TopicSearch, for example, runs only as an Internet Server API (ISAPI) DLL attached to Microsoft's Internet Information Server (IIS
). To automate use of its indexer, you would need to apply the technique I described in my November 1996 column ("On-Line Componentware") -- discover the URL-based API implicit in the Web interface and program to that API using a library that gives scripts the ability to "call" Common Gateway Interface (CGI) URLs.
Silly Claims
Considering all that utter rigmarole, the notion of just adding a command plus some arguments to your
/etc/crontab
file (or its NT equivalent) seems rather appealing. Also, the ease-of-use claim that some vendors make on behalf of their Web-integrated solutions can appear somewhat silly. Web-based control of indexers or other kinds of applications can be a useful adjunct to conventional methods, but it's not necessarily an appropriate replacement for them.
Advanced Indexing
If your site is even moderately complex, you probably won't want to just
point an indexer
at the Web server's root and let it rip. You'
ll likely have distinct subtrees that you want to index. Within those subtrees, you'll want to include some classes of files but not others. Some indexers (e.g., the Microsoft product) can't exclude files at all. If you want to restrict search results, you'll have to parse and filter them on the fly. That's doable, but it's complex and computationally expensive. If there are subdirectories and files you don't want people to find, it's best to leave them out of your index.
The first engine I implemented on the BYTE Site, freeWAIS, has only a weak exclusion mechanism. You can exclude wild-carded filenames but not wild-carded paths. Because I need to do the latter, my indexing scripts for freeWAIS are too verbose. They enumerate long lists of subtrees for inclusion, rather than short lists of subtrees to include along with short lists of patterns to exclude.
SWISH is more agile. I can use the rule PATHNAME CONTAINS IMG to exclude the dozens of directories in which I store the HTML wrappers for BYTE Site
images. These wrappers, which would otherwise be included in the index, contain hardly any useful text and are best left out.
If you're somebody who needs additional flexibility, EWS's exclusion mechanism is even more agile. Unlike SWISH, EWS lets you describe excluded items not only with wild cards, but with full-blown regular expressions. Why would you need to do that?
Consider the BYTE Site's conferencing application I discussed in last month's column, "Dual-Mode Conferencing." It generates multiple versions of each message to support both frame-based and frameless viewing. However, you probably don't want to see two copies of each message in the search results list, so I decided to exclude the frame-based set. When I indexed the conferences with SWISH, I found that I couldn't differentiate between the two classes of files using simple wild cards. Fortunately, EWS's regular-expression capability solves this sort of problem neatly and quickly.
Best-Kept Secret
The odds are t
hat if you've tried EWS on the BYTE Site or another of the sites where it runs, you've missed its best feature: query by example, or QBE (see the figure
"Sophisticated Searches"
). To try it for yourself, go to the BYTE Site and use EWS to search for Cyberdog. As you would expect, a list of clearly Web-related titles comes back.
What you might not expect, though, is that if you have used the default concept search setting, most of these articles won't include the word
Cyberdog
. They will, however, contain sets of terms (e.g.,
browser
,
HTTP
, and
OpenDoc
) that correlate statistically with the few articles that mention Cyberdog. If you're interested in how OpenDoc relates to Cyberdog, or if you hadn't even known that it did, you will appreciate EWS automatically making that connection for you. That's what EWS means by the notion of concept search -- and it's not even the best feature that users often don't get.
Here's the best feature: Click on the
red or black icon that introduces one of the documents in the Cyberdog result set whose title includes the term
OpenDoc
. This action says: "Find similar articles." The new result set will contain many more OpenDoc-related titles. If the example didn't contain Cyberdog, the new result set will have taken you on a quite different tack from the original query.
This refocusing mechanism can be incredibly useful when you're doing research. You naturally want to follow a branching path through conceptual space. After scanning a few of the articles in the OpenDoc result set, you might want to focus specifically on comparisons between OpenDoc and OLE.
If you click on the icon that goes with the likeliest candidate, EWS will return a third result set in which the OLE theme is more prominent. Every article on every search-results page is itself an implicit query -- a single-click accessor of a set of related articles.
Once you discover this principle, it transforms how you explore a document colle
ction. You don't have to worry about forming exactly the right search expression. Just seed the process with words that get you conceptually near what you're looking for and let QBE automatically feed statistical profiles back into the searcher as you click your way through a series of refinements.
It's brilliant, it's effective, and it's trivial to operate, yet many users never discover QBE even though every search page says "Click on the red or black icons to search for similar articles." I know this because I hacked the Perl wrapper to log search terms. (I always do this because analyzing what people search for tells me a lot about what kinds of information we ought to be providing.) Fewer than 5 percent of the first several thousand EWS queries I logged were of the QBE flavor, and many of these were my own tests. Moreover, an informal poll of BYTE staffers showed that while many had encountered EWS in their travels on the Web, none of them had discovered QBE.
"It's a problem," agrees Graham Spence
r, chief technology officer for Excite. "In academic information retrieval, the average search expression is 12 to 15 terms long; on the Web, it's 1.5 terms." EWS constructs those 15-term expressions for you automatically when you use QBE. That users often don't realize this is partly a failure of user-interface design -- the icons could be bigger, the instructions more prominent. But it's also a failure of expectations. Web surfers accustomed to more conventional search technology just don't expect EWS to do what it does.
Microsoft Index Server
As nifty as EWS is, it lacks four desirable features: a high-performance architecture, phrase and proximity search, field indexing and searching, and automatic indexing on demand. MSIS, though weak in some areas, offers these four features. To try it, you have to join a fairly exclusive club. MSIS doesn't just require NT and IIS; it demands NT 4.0 and IIS 2.0.
You're not prepared to obsolete a stable NT 3.51 production server just for this
purpose? Neither was I, so I ran MSIS on a development server, pointed it at an HTML collection on the production server, and tweaked the result URLs to refer to actual files on the production server instead of nonexistent ones on the test machine.
Both EWS and MSIS are running this way -- as true distributed services off-loaded from the primary production server, linked to the site by means of URLs. It's easy to create this kind of distributed search capability with MSIS, because it automatically finds and offers to index any virtual directories mounted on IIS. If the MSIS/IIS machine and the document server live in the same NT domain, MSIS can index the remote document server.
There's one catch, though. IIS needs a user name and password to mount the virtual directory. After I supplied these credentials, browsers talking to IIS could read documents in and below that directory. But MSIS couldn't index them. Even though it runs as an ISAPI extension to IIS, it has its own notion of access credentials
. I had to configure the IIS mount not just with a user name and password, but more specifically with a domain name\user name and password. Then it worked.
Implementing MSIS
For querying, MSIS uses the Internet Database Connector model again. In this scheme, an HTML form refers to a query-configuration (IDQ) file that names the index to search, enumerates which fields to return, and describes how to order those fields. The IDQ file also names an HTML template (HTX) file that will format the query results. To redirect the result URLs to the production server, I replaced occurrences of
<%server_name%>
in the HTX file with
www.byte.com
.
As with the Internet Database Connector, you can use other predefined variables with a simple IF ... THEN syntax to reformulate the result set (e.g., to chunk a long list of result URLs into a linked series of HTML pages). The HTX language is not powerful enough, however, to achieve the standard BYTE Site presentation of search
results. I use Perl to capture document titles emitted by SWISH, parse out an issue-date field (e.g., February 1996), and sort in reverse chronological order.
Several areas of the BYTE Site cry out for field indexing. In the Virtual Press Room, you should be able to do a field (rather than full-text) search for company and product names. In the conferences, similarly, you should be able to search author and subject fields. If you use meta tags to create fields in your HTML document headers, MSIS automatically uses them to create field indexes. For example, a VPR document header looks like this:
<html><head>
<meta name="company" content=
"SunSoft">
<meta name="product" content=
"Java Workshop">
When I indexed the VPR collection, MSIS constructed company and product indexes. To use them, I had to add this incantation to my IDQ file:
MetaCompany(DBTYPE
_STR) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 company
MetaProduct(DBTYPE
_STR) = d1b5d3f0-c0b3-11cf
-9a92-00a0c908dbf1 product
Then, I could issue the query
@MetaCompany SunSoft
to find all SunSoft press releases. What a great idea! Field indexing adds a new dimension to the full-text indexing so common on the Web. It's rarely done for two reasons: Indexers often don't support it, and document collections often don't provide fielded content. Leveraging meta tags as MSIS does is the right way to advance the cause of field indexing. Other engines, including Netscape's Catalog Server and the high-end version of Verity's TopicSearch, can also exploit meta tags.
Unfortunately, MSIS in its current form (version 1.1) can't read or manipulate the contents of these user-defined fields. So while you can search for SunSoft press releases, you can't write an HTX file that sorts the results by product name. And you can't even use the HTX file to print the values of the company and product fields.
The Ultimate Engine?
Basic though SWISH is, I continue to get a
lot of mileage out of it. In my view, there's no perfect search engine. If you haven't indexed your site yet, don't get too hung up on choosing the ultimate do-everything tool. Focus instead on tagging your data in ways that let you organize search results in useful ways. Meta tags are a great way to instrument your content so that results returned from any search engine can be sorted by date or category. Once that's done, you can easily replace a basic search tool with a fancier one.
TOOLWATCH
Internet Music Kit...................$49
Wildcat Canyon Software
Internet:
http://www.wildcat.com/
Embed MIDI files on your Web pages for streaming playback
using a Netscape plug-in or an ActiveX control.