uctured and semistructured. These data types require different tools and techniques.
For instance, if the question is "What percentage of European respondents uses ISDN both at work and at home?," then a SQL query will give you the answer. But if the question is "What do respondents say about the experience of buying and installing ISDN equipment?," then there's no quantifiable answer. You just need to turn a bunch of anecdotal information (i.e., text) into a navigable and searchable archive (i.e., a textbase) that you can conveniently review.
Last October's column left this part of the process as an exercise for the reader. It's now time to fill in the blanks.
Form2db Revisited
I continue to get a good deal of mileage out of this handy script. To recap briefly, form2db reads slightly modified HTML fo
rms. If the normal coding for a text-input field is
<input name=email type=text size=60>
then the modified version will look like this:
<input name=email type=text dbtype=char(60)>
These fake
dbtype
tags serve two purposes. First, they tell form2db to add processing code for this field to the CGI script that it creates. (This is the script that receives the form's data when you click on Submit and that stores the data to a structured text file.) Second, these tags tell form2db to add a data definition statement to the SQL script that it creates. (This is the script you run to create the database table that will store the form's structured data.)
HTML forms refer to the scripts that process them in this way:
<form action="/cgi-bin/isdn.pl">
The form2db script uses the name of the script in three ways: 1) To write the actual file isdn.pl, which contains the form's handler; 2) To instruct isdn.pl to store record
s in a subdirectory called isdn; and 3) To instruct isdn.sql to create a SQL table called isdn.
In the isdn subdirectory, isdn.pl stores a series of files with names like 000001.REC. Inside each of these files is an ASCII-ized representation of a Perl associative array (aka hash, Smalltalk dictionary, or Java hashtable). Here's an example:
%record = (
'name','Jon Udell',
'email','jon_u@dev5.byte.com',
'uses_isdn','no')
Text in this format has an interesting property: It's also Perl code. If in a subsequent Perl script you evaluate the expression
do '000001.REC';
then the result will be an in-memory hash, called
%record
, containing the keys and values shown above. From there, you can go in either direction -- it's easy to process a collection of these hashes into a SQL database, or into a textbase, or both.
There's a crucial security issue lurking under the surface here. You've got to make sure that the values of the hash -- what use
rs type into the fields of your form -- will not be evaluated as Perl expressions. I got this wrong initially, and I was very grateful that alert reader Dov Grobgeld pointed out my error (see the sidebar "A Subtle Perl Security Pitfall").
Building the Textbase
So, how can you transform those hunks of Perl
code into
a textbase? I wrote the textbase.pl program to perform this job. You point it at a collection of .REC files and an HTML template. As is true of the form that kicks off this whole process, the template can be as plain or as fancy as you like. You can create it with either a text editor or a WYSIWYG HTML editor. What links the template to the data? Wherever you want a field to appear, you insert a placeholder like this:
name="email"
When you run textbase.pl, it first absorbs the template and then (like a mail-merge program) runs through each of the .REC files in turn, substituting values for the placeholder. Thus, 000001.REC becomes 000001
.HTM, a Web page that reflects whatever styling you've applied to the template.
What if you want a new style? Just alter the template. The next time that textbase.pl runs, it upgrades the entire set of Web pages to the new style. When does textbase.pl run? That depends on how "hot" your users require the textbase to be. Some applications require nothing more than a scheduled daily update. Others require near-real-time feedback, in which case I spawn textbase.pl in a background process whenever somebody submits a new record.
This approach can't really be called dynamic HTML. Nor is it static. It's really a hybrid of those two methods, with some of the advantages of each. Because a process dynamically regenerates the textbase, you can propagate style changes uniformly across the data set. Because users fetch static pages, there's none of the run-time overhead or complexity that comes with on-demand page generation.
Note that the template can act like a filter. You don't need to include all the
fields from the form in the template. In fact, I typically send one subset of the form's fields over to a SQL database for relational analysis, and another subset to a textbase for interactive review.
Indexing the Textbase
Now you've got a pile of .HTM files, one per record. How to present them? Textbase.pl will also generate one or more index pages. For the ISDN survey, let's say you want to index by name and by country. A piece of the name index might look like this:
<br><a href="000002.htm">
Germany......Mauth, Rainer</a>
<br><a href="000001.htm">
US...........Udell, Jon</a>
As it processes records, textbase.pl builds a hash for each index you require. The keys of each index hash are just strings that begin with the primary key for that index (say, country), and then pad to some fixed width, and then tack on whatever other fields should appear on that index page. The values of each index hash are the generated HTML
filenames. After running through all the records, textbase.pl sorts the keys of the index hashes (a trivial operation in Perl) and then emits lines of HTML, as shown earlier.
This is data management at its most primal. Where is the database? There isn't one. Databases are great; don't get me wrong. But they can be overkill for simple applications that just need to collect and present a few thousand records. And when those records are text-heavy, conventional databases often aren't the best tool for the job.
Printing the Textbase
The May column's Toolwatch featured Canon's WebRecord, one answer to the surprisingly thorny problem of printing a collection of Web pages. In textbase.pl I provide another kind of answer. I decided that I would like to able to turn a Web textbase into a Microsoft Word document, complete with page numbers and a table of contents.
I thought I would create a target Word document, export it as Rich Text Format (RTF), and then teach textbase.pl to emit records
in that format. This was a dismal failure. RTF is nasty stuff -- hard to read, hard to write.
Here's the solution I came up with. I wrote a Visual Basic for Applications (VBA) module with the following routines:
DoTable
- insert a table into a document
DoCell
- insert a value into a table
DoText
- insert a chunk of free text
DoPage
- build a page using a series of
DoTable and DoText calls
DoPages
- call DoPage many times
Here's a sample VBA declaration of
DoPage
:
Sub DoPage(name, email)
And here's a sample use of
DoPage
:
Call DoPage(
"Jon Udell", "jon@byte.com")
This VBA code is another kind of template. To adapt it to a particular textbase, you declare a version of
DoPage
with a list of arguments corresponding to the fields of that textbase. Then you write an implementation of
DoPage
that emits the values of those fields as a series of tables or
text chunks.
Of course, if your textbase has 1000 records, each containing 37 fields, this is not a VBA program that any sane person would ever want to write. Fortunately, there's no need to. Perl can write it for you. That's just what textbase.pl does. Actually, it only does the hard part -- the declaration of
DoPage
, plus a bunch of parameterized calls (one per textbase record) to that routine. I incorporate this output into a VBA skeleton for my textbase, run the VBA macro, and voilà! Word builds a catalog automatically.
This technique underscores yet again the protean power of Perl. As USENIX president Andrew Hume likes to say, "Programs that write programs are the happiest programs in the world." Perl's a great language in which to write programs that write programs.
I've also learned a few things about the limitations of VBA. On my first try, textbase.pl simply wrote out
DoPage
calls once per line. That was fine for simple text-bases. But when I tried it on a
big textbase with many dozens of fields per record, I ran smack into two hard limits. First, you can't pass more than 63 arguments on a VBA call. Second, you can't hand the VBA interpreter a line that's longer than 1023 characters.
Oops. So much for taking the simple approach. Perl doesn't suffer from either of these limitations -- you can pass ridiculous numbers of arguments to functions and create insanely large strings -- but to accommodate VBA, I had to decompose everything.
The final solution that I came up with passes the data to VBA in files. Not just a file per record, either. I had to go all the way to file-per-field granularity. It's ugly, but it works.
Textbase.pl isn't (yet) a user-friendly, turnkey solution. But it's an inexpensive, portable, lightweight, effective, and extensible way to manage text data on the Web or your intranet.
TOOLWATCH
RegClean
Internet:
http://www.microsoft.com/kb/softlib/mslfiles/RegCln41.exe