Helpful techniques for Web-based data collection and analysis.
Jon Udell
The BYTE Site is, among other things, a giant survey application. Each of its 6000+ archive pages presents a link to a feedback form. Or rather, as I explained last October (
http://www.byte.com/art/9510/sec9/art1.htm
), to a script that generates a form that's customized for each article. Recently, I began harvesting this data to answer questions like "How highly do site visitors rate State of the Art articles?" and "How often do site visitors say they read BYTE magazine?" We've also run Web surveys to ask visitors about their expe
riences with ISDN, assessment of uninterruptible power supplies (UPSes), and OS preferences.
) or an equivalent library, you're given an in-memory structure of this type. Unfortunately, Perl lacks a primitive function to ASCII-ize such structures, but it's not difficult:
foreach $f (keys %record) {
s/\"/\\"/g;
print RECORD
"'".$f."\',\"".$in{$f}."\",\n";}
As long as you take care to convert double-quote to backslash-double-quote, as shown here, this technique handles multiline text fields nicely.
If the file handle
RECORD
maps to the file
0001.REC
, a subsequent Perl script can parse and reconstitute the
%record
array with the single statement
do '0001.REC';
What's the point of all this? It's now trivial to write Perl scripts that transform collections of such files into a variety of database import formats or directly to an HTML textbase. For relational analysis, sqlload.pl (
http://www.byte.com/art/download/textbase.zip
) produces a SQL load file containing a bunch of
INSERT INTO
statements. For textual analysis, I use variants of another script that builds a simple, navigable Web archive.
The Tool
Tools that marry Web forms to databases tend to assume, reasonably enough, that you're acquiring data directly into a database. They typically use templatized HTML forms containing triggers that read or write database fields. For a survey, however, it's convenient to separate data colle
ction from database import -- particularly when you're building both relational and textual databases from a common data set. Thus, there's no reason to use a template-oriented tool such as Cold Fusion or the Microsoft Internet Information Server (IIS) Internet Database Connector (IDC).
What's more, I've found that these products don't simplify database bookkeeping to the degree I'd like. They assume that you'll build both an HTML template and a corresponding database schema. You end up with two sets of field names that you have to maintain in sync. It's not a big deal, but life's short and I'm lazy, so I wrote form2db.pl (
http://www.byte.com/art/download/textbase.zip
) to generate a database schema automatically from an HTML form. This Perl script relies on the fact that browsers will qui
te happily ignore user-defined HTML attributes. For example, browsers render the input text box described by the following code:
The
dbtype
attribute, which I simply invented, means nothing to browsers. However, when I write forms using this attribute, they can double as database schemata. Form2db.pl parses these enhance forms and emits a
SQL CREATE
statement. Do you have to abandon your HTML editor if you go this route? Not if it's a smart one that knows how to preserve user-defined HTML. Adobe, for example, says PageMill 2.0 will do this.
There's still more mileage to be gotten out of form2db.pl. I've said that I store each record initially as an ASCII-ized Perl associative array. That implies a CGI script, wired to the form, that writes the ASCII file. It's a simple CGI script -- so simple, in fact, that form2db.pl can create it
automatically. When I wrote the form for our OS survey, for example, I pretended that the script os.pl already existed:
<form action="os.pl">
When form2db.pl reads this form, it makes the imaginary script real. The single occurrence of the name os in the form tag drives several related processes. It becomes the name of the table created by form2db.pl's SQL CREATE output, the name of the file containing that output, and the name of the subdirectory in which os.pl deposits records. The figure
"One Form, Many Uses"
summarizes these interactions.
The Analysis
In July's ToolWatch, I mentioned iodbc (
ftp://ftp.digex.net/pub/access/psii/iodbc.zip
), a command-line interface to the Open Database Connectivity (ODBC) subsystem on Windows 95 and NT. It's the thinnest-imaginable ODBC wrapper, and therein lies its strength. To load the OS survey da
ta into a database, I used ODBC Administrator to create a new data source called OS. (The driver was MS Jet 3.0 and the format was .mdb, but I could have used any database supported by ODBC.) I then issued one command to execute the
SQL CREATE
code written by form2db.pl:
iodbc -S OS < os.sql
and another to execute the
SQL INSERT INTO
code written by sqlload.pl:
iodbc -S OS < os.lod
Then I launched iodbc in interactive mode to begin exploring the data set:
iodbc -S os
1> select count(*) from os
This approach makes me a knuckle-scraping Neanderthal or an avant-garde minimalist, depending on your perspective. I see it the latter way, because I find that operating a full-blown wizard-equipped GUI database can take more time and effort than just typing the small bits of SQL you need. What's more, as you write those SQL statements, you discover patterns -- that is, opportunities to parameterize and automate SQL queries.
Perl coupled with iodbc is one way to exploit those opportunities, but it's awkward. Perl has to write SQL statements to a file, invoke iodbc on that file, and then parse iodbc's output. A better solution is odbc.pm, a
Perl 5 module
that makes ODBC SqlExecute and SqlFetch calls directly available to Perl programs. This has two major advantages: The SQL code doesn't have to take a trip through the file system, and its output comes back neatly chunked by row and column.
These two methods are complementary. I use iodbc when first exploring a data set and odbc.pm to codify the repeatable patterns that emerge from that exploration. Today, all this happens on my NT systems only because, while ODBC itself is available for Unix, iodbc and odbc.pm are not yet available. But the Perl scripts I'm distributing with this article will work fine on Unix, as will the generic SQL code they produce. If you're in need of a lightweight Unix SQL engine to use in conjunction with these, try msql (
http://www.bunyip.com/
).
The Methodology
What did our OS survey reveal? Nothing of value, I'm afraid. I got so absorbed in the mechanics of Web-based data collection and analysis that I ignored the most fundamental survey precept. BYTE senior editor Tom R. Halfhill puts it succinctly: "You can't let the studied population select itself." Team OS/2, an international band of OS/2 enthusiasts, drove that point home with a vengeance.
Two days into the survey, analysis showed that usage of the Mac OS, OS/2, Unix, and various Windows flavors -- on desktops and servers -- was comparable to what many other sources have reported. A few days later, Team OS/2 struck, and OS/2's numbers soared. An Alta Vista search of the Usenet uncovered one cause of the surge -- a posting to comp.os.os2.advocacy, whic
h contained the uniform resource locator (URL) of the survey page.
We reported the results in last month's Bits section on page 32. However, please don't quote the three pie charts, which show OS/2's dominance, without also quoting the portion of text showing that an OS survey response was 12 times more likely to come from the Internet domains ibm.com and ibm.net than was a typical BYTE Site visit.
We'll continue to run surveys on The BYTE Site. The anecdotal information we've gathered, for example from the ISDN survey, seems valuable as an indicator of trends and opinions. Quantitative data may have some limited value as well, on subjects charged with less religious fervor. However, we won't put much stock in the numbers until we can invite true random samples of participants, probably from a (yet-to-be-developed) site-registration database. For that valuable lesson learned, we have Team OS/2 to thank.
BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin,
and dozens of other CMP publications—bringing
you critical news and information about wireless communication,
computer security, software development, embedded systems,
and more!