A special file informs Web catalog and search engines to exclude specific pages on a Web site.
Tonya Engst
A visit to a Web catalog, such as Yahoo, or a search engine, such as Digital Equipment's AltaVista, makes you wonder how these sites track such enormous collections of Web pages. As a Web administrator, you might be surprised at the number of pages from your own Web server that are referenced by such sites. Although catalog sites often employ humans to verify and classify Web pages, many of them also harvest and maintain vast quantities of information through the use of automa
ted programs called
robots
.
Robots typically start with a page of links and recursively follow all the links from that initial page. The robot itself doesn't traverse the Web; it merely requests pages from sites pointed to by the links. A robot's common starting points include sites that catalog the most popul
ar sites within a topic, lists of links resulting from scans of Usenet postings or mailing-list archives, or lists of URLs submitted manually.
For each requested page, the robot records salient information. For example, it might scan for keywords in a META tag or look at the page's title. At other times it might record the first paragraph or so on a page or parse an entire page for keywords.
Although robots serve the useful purpose of adding sites to Web-search sites, they can also overwhelm a server's resources by barraging a site with multiple requests. Further, they might record Web pages that you don't want to appear in Web-search sites: Su
ch pages may not be quite ready for the public eye (perhaps they're still under construction) or may constitute illogical site-entry points. In other cases, pages might contain information that's not private enough to place behind a password system but not sufficiently public to make readily available to Jane or Joe Websurfer. For example, clubs might place committee contact information on-line, yet pages of this type don't necessarily belong in generic worldwide search sites.
A Standard to the Rescue
Fortunately, robots can be steered away from certain pages through the use of the Standard for Robots Exclusion (SRE). Initiated by Martijn Koster in 1994, this informal standard makes it easy for Web administrators to exclude certain portions of a site from a robot's examination. Although many robots respect the SRE, it's not a formal standard, nor is it enforced.
Koster has created an Internet Draft of the SRE and plans to submit it to the Internet Engineering Task Force (IETF) f
or further discussion and standardization. Further information about robots and the SRE is available from Koster's The Web Robots Pages, at
http://info.webcrawler.com/mak/projects/robots/robots.html
.
Implementing the SRE requires nothing more than the creation of a text file called
robots.txt
. (Neophyte Web administrators might be puzzled by failed requests for this file, since polite robots attempt to request directions from it.) The
robots.txt
file acts as a guide to your site and highlights areas that robot visitors should avoid, as shown in the figure
"How the Standard for Robots Exclusion (SRE) Works"
. You set up the file with return-delimited records, each containing one
User-agent
field and at least one
Disallow
field. (
User-agent
is jargon for a program -- such as a robot -- that handles networking tasks.) The SRE is flexible about end-of-line characters, so you need not worry about carriage returns, linefeeds, and the like; simply use whatever is convenient.
The easiest
robots.txt
file to use is an empty one. A blank
robots.txt
file stops errors from appearing in your log and tells robots they are free to traverse the entire site. The second-easiest
robots.txt
file to use contains two lines and displays a no-trespassing sign for all robots:
User-agent: *
Disallow: /
The asterisk in the first line serves as a token to indicate all robots; the second line disallows the entire site. Although the asterisk in the
User-agent
field acts like a wild card for all robots, you cannot use wild-card characters in any other way within
robots.txt
.
The listing
"Deluxe
robots.txt
File"
uses multiple records to ac
complish several tasks. First, it asks a particular program,
roguebot
, to stay out of the entire site. You might use a record like this if any program has a habit of rapidly hitting multiple pages on your site, effectively shutting down the site during that time. Second, it gives another program, in this case
helpbot
, complete site access.
Some real-world sample robot names include ibm, for IBM_Planetwide, which indexes and mirrors IBM-owned domains; and webcrawler, which builds the database search service owned by America Online. Identities of a variety of robots can be found at
http://info.webcrawler.com/mak/projects/robots/active.html
.
Finally, the third record in the listing uses a series of
Disallow
fields to restrict all other robots
from certain portions of the site. A
Disallow
field indicates that robots should avoid all relative URLs that begin with a specified character string. In this example, robots are excluded from all pages and directories that sit on the main level, and they have names beginning with the word
private
. For instance, all pages and subdirectories in directories called
private
and
private1
would be off-limits, as would a file named
private.html
. Additionally,
personal.html
in the
sharon
directory is off-limits, as are all pages that are located in the
sam
directory.
To make a comment, such as the one in the last line of the listing, precede it with a # character. Comments can also go on lines by themselves. You
must
place the finished
robots.txt
file in the root, or top-level, directory of your Web site; robots ignore
robots.txt
files located elsewhere. In the future, as ideas in the SRE's Internet Draft are soli
dified, robots might recognize an
Allow
field (which would act as an opposite to
Disallow
) within
robots.txt
records.
If you cannot modify the
robots.txt
file for your Web server and can't persuade anyone to modify
robots.txt
for you, you must resort to stating privacy requests in the META tags of individual Web pages. Such META-tag requests are not as commonly honored by robots, but they're still worth implementing.
As you are probably already aware, a META tag goes in the head portion of an HTML document, as shown in the listing
"META Tags Control Access"
. In this listing, the META tag includes the attribute
NAME="ROBOTS"
, as well as a
CONTENT
attribute. In addition,
CONTENT
has the comma-delimited values
NOINDEX
and
NOFOLLOW
. The first value tells robots not to record information about the page; the second indicates that robots should not follow links on the page.
For instance,
given a site where you want search sites to include the home page, but not any subpages, on the home page you would use the META tag
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
. On other pages, you could go for a full-fledged
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
, but
<META NAME="ROBOTS" CONTENT="NOINDEX">
would get the job done.
Security Issues
How secure is
robots.txt
? Most above-board robots use
robots.txt
files and respect their commands. Unsavory or poorly mannered robots might ignore
robots.txt
altogether. Furthermore, for robots seeking sensitive material,
robots.txt
can actually point them to where it's located on a Web site. Truly secret pages should be protected by traditional security, such as a password system or a firewall.
In the end, of course, data is most secure from robots if it resides on a system not attached to a network. In addition, products such as the $749 DynaMorph,
from Morph Technologies (for the Macintosh and Windows), and the $195 NetCloak, from Maxum Development (for Macs), employ server-side extensions to HTML that enable Web administrators to limit access to pages based on an IP address or a User-agent.
The SRE certainly isn't a panacea for securing sensitive information. However, it uses a common-sense technique that keeps inappropriate pages out of many Web-search sites. The SRE also can improve the access to your Web site by reducing traffic due to a high number of search requests. If you haven't yet made a
robots.txt
file for your Web server, perhaps now would be a good time to start.
Where to Find
Maxum Development Corp.
Streamwood, IL
Phone: 630-830-1113
Fax: 630-830-1262
E-mail:
info@maxum.com
Internet:
http://www.maxum.com/
Morph Technologies, Inc.
Reston, VA
Phone: 703-716-0677
Fax: 703-716-0691
E-mail:
MorphInfo@morphtech.com
Internet:
http://www.morphtech.com/
HotBYTEs
- information on products covered or advertised in BYTE
User-agent: roguebot
Disallow: /
User-agent: helpbot
Disallow:
User-agent: *
Diasllow: /private
Disallow: /~sharon/personal.html
Disallow: /~sam/ #marks sam with a no-trespassing sign
<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<TITLE>Private Musings and Private Links</TITLE>
</HEAD>
<BODY>
...
</BODY>
</HTML>
illustration_link (21 Kbytes)

The SRE steers robot programs away from a Web site's private material.
Tonya Engst works as senior editor for Tid-BITS, a seven-year-old electronic newsletter focusing on Macintosh and Internet topics. She writes frequently about Web- and HTML-related topics and has written HTML chapters for several editions of I
nternet Starter Kit (Hayden Books). You can contact her at
http://www.tidbits.com/tonya/
.