Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesLessons Learned


J uly 1996 / Web Project / Lessons Learned

Advice distilled from a year of site-building experience and reader feedback.

Jon Udell

A year ago, the BYTE Site served up its first document. By the time you read this, it will have delivered a million BYTE articles to a half-million visitors--an audience as large as our magazine's. Along the way, I've learned a lot about electronic publishing. And I've discovered how Internet-enabled applications can transform a Web site from a library of documents into a groupware platform that can energize the internal operations of a business as well as its relationships with partners and customers.

As a Web developer, I enjoy a unique advantage. The progress report that I file here each month is reviewed by a half-million computer professionals--you, the readers of BYTE. Your feedback has guided and enriched my ongoing R&D project. This month, I'll pass along some of the lessons I've learned.

An HTML Generator Keeps Paying Dividends

In last July's inaugural column, I described a technique for generating a Hypertext Markup Language (HTML) archive from a collection of well-structured ASCII texts. This approach continues to pay handsome dividends. Every time I enhance the HTML generator (a program written in EEL--the C-like extension language of Lugaru's Epsilon text editor), new features propagate instantly throughout the entire BYTE on-line archive. Some of the improvements added this way include:


--
 
activation
 of URL-like character strings (automatic link creation)

--
 
links to
 a unique-per-page feedback form

--
 
links from
 thumbnails to full-sized images

--
 
new image-containing
 documents that wrap titles and captions a
round bare GIFs -- rich table-of-contents pages that use indentation and data-type icons to display the structure of articles and the number and type of attachments

There's no end in sight. For example, a reader pointed out that pages with thumbnails would load faster if I added WIDTH and HEIGHT attributes to each thumbnail's IMG tag. And he showed me how to extract image dimensions from a GIF file: Bytes 7 and 8 are the width, 9 and 10 are the height. As soon as I write the EEL function to extract these values, tweak the function that emits the IMG tag for thumbnails, and rebuild the archive, the hundreds of pages containing thumbnails will improve.

Where There Is Structure, You Can Add Value

I was lucky to inherit the collection of well-structured text files on which my HTML generator operates. Another source of structure widely exploited on the Net is the RFC 822 format to which e-mail and Usenet messages conform. Searchable and navigable Web archives of listserv and Us enet discussions are among the most useful information resources on the Web.

Why do these resources exist? The reasons are so obvious that, ironically, we tend to forget them. Messages have a regular structure--From:, Subject:, Date:. And messages are in ASCII therefore open to a wealth of programming tools.

Where There Isn't Structure, Create It

What happens when you exhaust all your legacy sources of regularly structured ASCII texts? You can create new ones! That's just what VPR, our Virtual Press Room, does (for details, see two previous columns: "Global Groupware" and "Perl Magic," November and December 1995 BYTE). Every day, vendors and public-relations professionals contribute a half-dozen documents to this database. The Web application that receives these documents converts them into HTML and creates a header (using HTML META tags) to store company, product, date, and other fielded data.

I'm continually finding new ways to explo it that header. During Fall Comdex, for example, I added a field to track press releases for products that vendors were nominating for the BYTE Best of Comdex awards. In the spring, I used a different field to track CeBIT award nominees. Then it dawned on me that the VPR database had acquired a Lotus Notes-like flexibility. An individual "record" in this database has no canonical shape. Some records include Comdex fields, some include CeBIT fields, many include neither. A Perl script scans the headers and builds the views we need.

Make the Most of Full-Text Search

In last September's "Web Search" column, I talked about two of the Web's freely available index-and-search tools: freeWAIS and SWISH. Both are still in play on the BYTE Site. At first I favored freeWAIS because it indexed and searched faster than SWISH. Now that the archive has grown to more than 5000 documents, I'm finding that freeWAIS's relevance-ranking feature gets in the way. If you're looking for ISDN , it thinks that a 50-word article containing that term is highly relevant because the term represents a relatively high proportion of the document's total content. So it prefers BYTE's What's New product announcements that mention ISDN to the meatier Features or Reviews that users more often want to find. To eliminate this bias, I'll probably convert our default search function from freeWAIS to SWISH.

Because Web search tools typically return document titles (that is, what's between <title> and </title> in the HTML header), you should think carefully about how you construct titles. I got ours partly right, but partly wrong. Here's what I got right: Every title that my HTML generator emits encodes three important pieces of information. The issue (for example, February 1995) tells you the age of the article; the section (such as Reviews, What's New) tells you the type of the article; the title tells you about the article. This combination of clues made the results of my first search implementations much more useful than would have been the case had I used HTML document titles alone. It also enabled the refinement I introduced a few months later. I tweaked the search scripts to pick out the issue names, sort on them, and group the search results in reverse-chronological order.

Here's what I got wrong: I forgot to include a fourth item, BYTE Magazine , in our titles. Why is that needed? On the BYTE Site, it isn't. Obviously, any search results gleaned from there refer to BYTE articles. But that's not so obvious when you view search results on Alta Vista or Open Text, where hits on the BYTE archive intermingle with hits from many other sources. The lesson here is subtle but profound. The Web punishes parochialism. You have to think globally. Your site isn't an island; it's part of a self-organizing federation of sites. Software components succeed by exposing clean interfaces to "the outside world." So, increasingly, will Web sites.

Use Full-Tex t Search with Fielded Search

Responding to my "Web Search" column, Ulrich Pfeifer alerted me to a derivative of freeWAIS, called freeWAIS-sf, which he maintains at http://ls6-www.informatik.uni-dortmund.de/projects/freeWAIS-sf/ . With freeWAIS-sf, you can index fields within documents. I've prototyped a version of VPR, for example, in which you can search for occurrences of Borland in the company field of each document--a much more precise search than for Borland anywhere in the text. Moreover, you can combine full-text and fielded search to answer questions, like "Which VPR documents from Borland, Microsoft, or Symantec mention Java compilers?"

I admit I've struggled a bit with freeWAIS-sf. I never was able to compile it successf ully on our BSD/OS system. However, I did get it working on an SGI Indy. And there's a prebuilt binary available for the Linux system I just added to our server farm. The method for defining field indexes is obscure, but the idea behind the method is brilliant. You use regular expressions to describe regions of documents for which indexes should be built (see "Defining and Using Field Indexes with freeWAIS-sf" ). This is an incredibly powerful technology. I hope to get a lot of mileage out of freeWAIS-sf. I'm also on the lookout for commercial products that retain its flexibility but are easier to use, such as the forthcoming Web-enabled version of InMagic's DB/TextWorks, called DB/Text WebServer ( http://www.inmagic.com ).

A Perl Apprenticeship

"There's more than one way to do it," Perl hackers say. Because Perl is such a forgiving language, I was able to build useful Common Gateway Interface (CGI) applications almost immediately. But these early efforts were hardly models of Perl style. When I began to publish these scripts, Perl gurus kindly showed me how to improve them (see "Feedback from a Perl Master" ). Here are a few observations of my own:

1) Always localize variables. Failure to do this led to the worst Perl bug I've inflicted on myself.

2) Decompose regular expressions into named subexpressions. The regular-expression syntax that Perl shares with many other Unix-derived tools is as subtle as it is powerful. Learn to decompose patterns into subparts. Test these individually, then assemble composite patterns incrementally, testing at each stage. Name the subparts so that the final composite expression makes sense.

Failure to properly decompose patterns leads to ugly results. Consider the pattern I sho wed in the "Web Search" column; it recognizes URL-like expressions in order to wrap HTML link syntax around them. It was easy, I thought. Look for a protocol prefix (http://, ftp://) followed by any string of characters not illegal in a URL. But that wasn't quite right. What happens when a URL appears at the end of a sentence? The final period, which matches as part of the pattern, invalidates the link. I'll spare you the contortions I went through and the silly solution I came up with. Suffice it to say that Earl Hood's MHonArc (ehood@convex.com), an e-mail-to-Web converter, showed me the answer. There should really be three subparts: a protocol prefix, a string of characters legal in a URL, and (crucially) a single URL-terminating character from a set similar to the second subpattern but lacking the period.

3) Don't be afraid of Perl 5. I was, at first. I found Perl 4's syntax daunting enough, and figured I should master it before tackling Perl 5's nested data s tructures and object orientation. That wasn't a bad idea in general, but it did lead to some unnecessary habits that I could have avoided. Most notably, I adopted a questionable technique for passing data among subroutines. The routine that parses VPR entries, for example, returns two lists. One list enumerates errors that the user must fix, the other warns of problems that the user may fix or ignore. But the routine couldn't return a list of these two lists. Why not? Perl flattens concatenated lists. Instead of a two-element list (an errors list and a warnings list), what came back was a single, undifferentiated list.

I solved the problem by injecting a sentinel character (I used the tilde) between the two lists. This approach, which I went on to use extensively, is workable but inelegant. You have to ensure that the sentinel never appears in any list element. And you need a different sentinel for each level of nesting. Far better to use Perl 5 to create references to lists (and also to associative ar rays). That way, a function can return a true list of lists to a calling function, which can then unpack those lists in a standard way.

Avoid These Mistakes!

It's worth repeating: The Web punishes parochialism. Here are three mistakes that all resulted from my failure to grasp the big picture:

1) Don't hard-code IP addresses. The VPR archive lives on an auxiliary server. When I wrote the script that builds views of that archive, I hard-coded that server's address. Avoiding a DNS lookup can save a little time, and since I regenerate the views daily, switching addresses wouldn't be a problem, right? Wrong. Next week, I'm moving the server to our new T-1 link, where it will have a different address. And, yes, I can and will rebuild the views. But I can't rebuild the Alta Vista index that refers to VPR URLs at a soon-to-be-invalid address. Oops.

2) Avoid NTisms. Christopher Wanko wrote to alert me to the use of backslashes in the URLs I was crea ting on our search-results pages. Why the backslashes? On NT, freeWAIS returns NT-style path names, which I was foolishly pasting onto "http://www.byte.com" to form URLs. This generally worked, because most NT Web servers treat forward and backward slashes identically. Since my server could handle the mixed format, there wasn't a problem, right? Wrong. Chris was accessing our site from behind a Unix proxy server that barfed when it saw the hybrid URL. Oops.

3) Quote all URLs. You're supposed to do this: <a href="/file.htm">. Instead, I got into the habit of doing this: <a href=/file.htm>. None of the browsers I've tried cares one way or another about the latter quoteless style, so it's no problem, right? Wrong. A reader wrote to tell me that his company's firewall couldn't handle the quoteless URLs. Oops again. Don't lose sight of the big picture.

As the BYTE Site continues to evolve in its second year, I'll build on the lessons I learned in the first year, make new, and perhaps even more-enlightening, mistakes in the future, and continue to learn together with BYTE readers.


TOOLWATCH


iodbc  (freeware)

FFE Software
Phone:    (510) 232-6800


BOOKNOTE

Java in a Nutshell by David Flanagan

O'Reilly and Associates
Internet: 
http://www.ora.com/

Price:    $19.95

O'Reilly is famous for definitive technical references. Here's another reason why.


Defining and Using Field Indexes with freeWAIS-sf

illustration_link (8 Kbytes)


Feedback from a Perl Master

illustration_link (6 Kbytes)


Java in a Nutshell

photo_link (16 Kbytes)


Jon Udell is BYTE's executive editor for new media. You can contact him by sending e-mail to judell@bix.com .

Up to the Web Project section contentsSearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network