If you are a webmaster, then you are probably already familiar with
robots.txt or you have at least heard about it. This tool is used to exclude search engines from spidering particular content on your site.
A Brief History
Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL “/robots.txt”.
Example of a robots.txt file:
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /foo.html
More information can be found on
http://www.robotstxt.org or visit the
Robot Control Code Generation Tool to prepare your own robots.txt.