Archives
 
 
 
  Special
 
 
 
  About Us
 
 
 

Newsletter
Free E-mail Newsletter from BYTE.com

 
    
           
Visit the home page Browse the four-year online archive Download platform-neutral CPU/FPU benchmarks Find information for advertisers, authors, vendors, subscribers Request free information on products written about or advertised in BYTE Submit a press release, or scan recent announcements Talk with BYTE's staff and readers about products and technologies

ArticlesThe Internet Robot's Guide to a Web Site


May 1997 / Core Technologies / The Internet Robot's Guide to a Web Site

A special file informs Web catalog and search engines to exclude specific pages on a Web site.

Tonya Engst

A visit to a Web catalog, such as Yahoo, or a search engine, such as Digital Equipment's AltaVista, makes you wonder how these sites track such enormous collections of Web pages. As a Web administrator, you might be surprised at the number of pages from your own Web server that are referenced by such sites. Although catalog sites often employ humans to verify and classify Web pages, many of them also harvest and maintain vast quantities of information through the use of automa ted programs called robots .

Robots typically start with a page of links and recursively follow all the links from that initial page. The robot itself doesn't traverse the Web; it merely requests pages from sites pointed to by the links. A robot's common starting points include sites that catalog the most popul ar sites within a topic, lists of links resulting from scans of Usenet postings or mailing-list archives, or lists of URLs submitted manually.

For each requested page, the robot records salient information. For example, it might scan for keywords in a META tag or look at the page's title. At other times it might record the first paragraph or so on a page or parse an entire page for keywords.

Although robots serve the useful purpose of adding sites to Web-search sites, they can also overwhelm a server's resources by barraging a site with multiple requests. Further, they might record Web pages that you don't want to appear in Web-search sites: Su ch pages may not be quite ready for the public eye (perhaps they're still under construction) or may constitute illogical site-entry points. In other cases, pages might contain information that's not private enough to place behind a password system but not sufficiently public to make readily available to Jane or Joe Websurfer. For example, clubs might place committee contact information on-line, yet pages of this type don't necessarily belong in generic worldwide search sites.

A Standard to the Rescue

Fortunately, robots can be steered away from certain pages through the use of the Standard for Robots Exclusion (SRE). Initiated by Martijn Koster in 1994, this informal standard makes it easy for Web administrators to exclude certain portions of a site from a robot's examination. Although many robots respect the SRE, it's not a formal standard, nor is it enforced.

Koster has created an Internet Draft of the SRE and plans to submit it to the Internet Engineering Task Force (IETF) f or further discussion and standardization. Further information about robots and the SRE is available from Koster's The Web Robots Pages, at http://info.webcrawler.com/mak/projects/robots/robots.html .

Implementing the SRE requires nothing more than the creation of a text file called robots.txt . (Neophyte Web administrators might be puzzled by failed requests for this file, since polite robots attempt to request directions from it.) The robots.txt file acts as a guide to your site and highlights areas that robot visitors should avoid, as shown in the figure "How the Standard for Robots Exclusion (SRE) Works" . You set up the file with return-delimited records, each containing one User-agent field and at least one Disallow field. ( User-agent is jargon for a program -- such as a robot -- that handles networking tasks.) The SRE is flexible about end-of-line characters, so you need not worry about carriage returns, linefeeds, and the like; simply use whatever is convenient.

The easiest robots.txt file to use is an empty one. A blank robots.txt file stops errors from appearing in your log and tells robots they are free to traverse the entire site. The second-easiest robots.txt file to use contains two lines and displays a no-trespassing sign for all robots:

User-agent: *
Disallow: /

The asterisk in the first line serves as a token to indicate all robots; the second line disallows the entire site. Although the asterisk in the User-agent field acts like a wild card for all robots, you cannot use wild-card characters in any other way within robots.txt .

The listing "Deluxe robots.txt File" uses multiple records to ac complish several tasks. First, it asks a particular program, roguebot , to stay out of the entire site. You might use a record like this if any program has a habit of rapidly hitting multiple pages on your site, effectively shutting down the site during that time. Second, it gives another program, in this case helpbot , complete site access.

Some real-world sample robot names include ibm, for IBM_Planetwide, which indexes and mirrors IBM-owned domains; and webcrawler, which builds the database search service owned by America Online. Identities of a variety of robots can be found at http://info.webcrawler.com/mak/projects/robots/active.html .

Finally, the third record in the listing uses a series of Disallow fields to restrict all other robots from certain portions of the site. A Disallow field indicates that robots should avoid all relative URLs that begin with a specified character string. In this example, robots are excluded from all pages and directories that sit on the main level, and they have names beginning with the word private . For instance, all pages and subdirectories in directories called private and private1 would be off-limits, as would a file named private.html . Additionally, personal.html in the sharon directory is off-limits, as are all pages that are located in the sam directory.

To make a comment, such as the one in the last line of the listing, precede it with a # character. Comments can also go on lines by themselves. You must place the finished robots.txt file in the root, or top-level, directory of your Web site; robots ignore robots.txt files located elsewhere. In the future, as ideas in the SRE's Internet Draft are soli dified, robots might recognize an Allow field (which would act as an opposite to Disallow ) within robots.txt records.

If you cannot modify the robots.txt file for your Web server and can't persuade anyone to modify robots.txt for you, you must resort to stating privacy requests in the META tags of individual Web pages. Such META-tag requests are not as commonly honored by robots, but they're still worth implementing.

As you are probably already aware, a META tag goes in the head portion of an HTML document, as shown in the listing "META Tags Control Access" . In this listing, the META tag includes the attribute NAME="ROBOTS" , as well as a CONTENT attribute. In addition, CONTENT has the comma-delimited values NOINDEX and NOFOLLOW . The first value tells robots not to record information about the page; the second indicates that robots should not follow links on the page.

For instance, given a site where you want search sites to include the home page, but not any subpages, on the home page you would use the META tag <META NAME="ROBOTS" CONTENT="NOFOLLOW"> . On other pages, you could go for a full-fledged <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> , but <META NAME="ROBOTS" CONTENT="NOINDEX"> would get the job done.

Security Issues

How secure is robots.txt ? Most above-board robots use robots.txt files and respect their commands. Unsavory or poorly mannered robots might ignore robots.txt altogether. Furthermore, for robots seeking sensitive material, robots.txt can actually point them to where it's located on a Web site. Truly secret pages should be protected by traditional security, such as a password system or a firewall.

In the end, of course, data is most secure from robots if it resides on a system not attached to a network. In addition, products such as the $749 DynaMorph, from Morph Technologies (for the Macintosh and Windows), and the $195 NetCloak, from Maxum Development (for Macs), employ server-side extensions to HTML that enable Web administrators to limit access to pages based on an IP address or a User-agent.

The SRE certainly isn't a panacea for securing sensitive information. However, it uses a common-sense technique that keeps inappropriate pages out of many Web-search sites. The SRE also can improve the access to your Web site by reducing traffic due to a high number of search requests. If you haven't yet made a robots.txt file for your Web server, perhaps now would be a good time to start.


Where to Find


Maxum Development Corp.

Streamwood, IL
Phone:    630-830-1113
Fax:      630-830-1262
E-mail:   
info@maxum.com

Internet: 
http://www.maxum.com/


Morph Technologies, Inc.

Reston, VA
Phone:    703-716-0677
Fax:      703-716-0691
E-mail:   
MorphInfo@morphtech.com

Internet: 
http://www.morphtech.com/


HotBYTEs
 - information on products covered or advertised in BYTE


Deluxe robots.txt File

User-agent: roguebot
Disallow: /
User-agent: helpbot
Disallow:
User-agent: *
Diasllow: /private
Disallow: /~sharon/personal.html
Disallow: /~sam/      #marks sam with a no-trespassing sign




META Tags Control A ccess

<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<TITLE>Private Musings and Private Links</TITLE>
</HEAD>
<BODY>
...
</BODY>
</HTML>




How the Standard for Robots Exclusion (SRE) Works

illustration_link (21 Kbytes)

The SRE steers robot programs away from a Web site's private material.


Tonya Engst works as senior editor for Tid-BITS, a seven-year-old electronic newsletter focusing on Macintosh and Internet topics. She writes frequently about Web- and HTML-related topics and has written HTML chapters for several editions of I nternet Starter Kit (Hayden Books). You can contact her at http://www.tidbits.com/tonya/ .

Up to the Core Technologies section contentsGo to previous article: SearchSend a comment on this articleSubscribe to BYTE or BYTE on CD-ROM  
Flexible C++
Matthew Wilson
My approach to software engineering is far more pragmatic than it is theoretical--and no language better exemplifies this than C++.

more...

BYTE Digest

BYTE Digest editors every month analyze and evaluate the best articles from Information Week, EE Times, Dr. Dobb's Journal, Network Computing, Sys Admin, and dozens of other CMP publications—bringing you critical news and information about wireless communication, computer security, software development, embedded systems, and more!

Find out more

BYTE.com Store

BYTE CD-ROM
NOW, on one CD-ROM, you can instantly access more than 8 years of BYTE.
 
The Best of BYTE Volume 1: Programming Languages
The Best of BYTE
Volume 1: Programming Languages
In this issue of Best of BYTE, we bring together some of the leading programming language designers and implementors...

Copyright © 2005 CMP Media LLC, Privacy Policy, Your California Privacy rights, Terms of Service
Site comments: webmaster@byte.com
SDMG Web Sites: BYTE.com, C/C++ Users Journal, Dr. Dobb's Journal, MSDN Magazine, New Architect, SD Expo, SD Magazine, Sys Admin, The Perl Journal, UnixReview.com, Windows Developer Network