and
enriched my ongoing R&D project. This month, I'll pass along some of the lessons I've learned.
An HTML Generator Keeps Paying Dividends
In last July's inaugural column, I described a technique for generating a Hypertext Markup Language (HTML) archive from a collection of well-structured ASCII texts. This approach continues to pay handsome dividends. Every time I enhance the HTML generator (a program written in EEL--the C-like extension language of Lugaru's Epsilon text editor), new features propagate instantly throughout the entire BYTE on-line archive. Some of the improvements added this way include:
--
activation
of URL-like character strings (automatic link creation)
--
links to
a unique-per-page feedback form
--
links from
thumbnails to full-sized images
--
new image-containing
documents that wrap titles and captions a
round bare GIFs -- rich table-of-contents pages that use indentation and data-type icons to display the structure of articles and the number and type of attachments
There's no end in sight. For example, a reader pointed out that pages with thumbnails would load faster if I added WIDTH and HEIGHT attributes to each thumbnail's IMG tag. And he showed me how to extract image dimensions from a GIF file: Bytes 7 and 8 are the width, 9 and 10 are the height. As soon as I write the EEL function to extract these values, tweak the function that emits the IMG tag for thumbnails, and rebuild the archive, the hundreds of pages containing thumbnails will improve.
Where There Is Structure, You Can Add Value
I was lucky to inherit the collection of well-structured text files on which my HTML generator operates. Another source of structure widely exploited on the Net is the RFC 822 format to which e-mail and Usenet messages conform. Searchable and navigable Web archives of listserv and Us
enet discussions are among the most useful information resources on the Web.
Why do these resources exist? The reasons are so obvious that, ironically, we tend to forget them. Messages have a
regular
structure--From:, Subject:, Date:. And messages are
in ASCII
therefore open to a wealth of programming tools.
Where There Isn't Structure, Create It
What happens when you exhaust all your legacy sources of regularly structured ASCII texts? You can create new ones! That's just what VPR, our Virtual Press Room, does (for details, see two previous columns:
"Global Groupware"
and
"Perl Magic,"
November and December 1995 BYTE). Every day, vendors and public-relations professionals contribute a half-dozen documents to this database. The Web application that receives these documents converts them into HTML and creates a header (using HTML META tags) to store company, product, date, and other fielded data.
I'm continually finding new ways to explo
it that header. During Fall Comdex, for example, I added a field to track press releases for products that vendors were nominating for the BYTE Best of Comdex awards. In the spring, I used a different field to track CeBIT award nominees. Then it dawned on me that the VPR database had acquired a Lotus Notes-like flexibility. An individual "record" in this database has no canonical shape. Some records include Comdex fields, some include CeBIT fields, many include neither. A Perl script scans the headers and builds the views we need.
Make the Most of Full-Text Search
In last September's
"Web Search"
column, I talked about two of the Web's freely available index-and-search tools: freeWAIS and SWISH. Both are still in play on the BYTE Site. At first I favored freeWAIS because it indexed and searched faster than SWISH. Now that the archive has grown to more than 5000 documents, I'm finding that freeWAIS's relevance-ranking feature gets in the way. If you're looking for
ISDN
, it thinks that a 50-word article containing that term is highly relevant because the term represents a relatively high proportion of the document's total content. So it prefers BYTE's What's New product announcements that mention ISDN to the meatier Features or Reviews that users more often want to find. To eliminate this bias, I'll probably convert our default search function from freeWAIS to SWISH.
Because Web search tools typically return document titles (that is, what's between <title> and </title> in the HTML header), you should think carefully about how you construct titles. I got ours partly right, but partly wrong. Here's what I got right: Every title that my HTML generator emits encodes three important pieces of information. The
issue
(for example, February 1995) tells you the age of the article; the
section
(such as Reviews, What's New) tells you the type of the article; the
title
tells you about the article. This combination of clues made the results
of my first search implementations much more useful than would have been the case had I used HTML document titles alone. It also enabled the refinement I introduced a few months later. I tweaked the search scripts to pick out the issue names, sort on them, and group the search results in reverse-chronological order.
Here's what I got wrong: I forgot to include a fourth item,
BYTE Magazine
, in our titles. Why is that needed? On the BYTE Site, it isn't. Obviously, any search results gleaned from there refer to BYTE articles. But that's not so obvious when you view search results on Alta Vista or Open Text, where hits on the BYTE archive intermingle with hits from many other sources. The lesson here is subtle but profound. The Web punishes parochialism. You have to think globally. Your site isn't an island; it's part of a self-organizing federation of sites. Software components succeed by exposing clean interfaces to "the outside world." So, increasingly, will Web sites.
Use Full-Tex
t Search with Fielded Search
Responding to my
"Web Search"
column, Ulrich Pfeifer alerted me to a derivative of freeWAIS, called freeWAIS-sf, which he maintains at
http://ls6-www.informatik.uni-dortmund.de/projects/freeWAIS-sf/
. With freeWAIS-sf, you can index fields within documents. I've prototyped a version of VPR, for example, in which you can search for occurrences of