Tips & Tricks
Advanced Topics
Search Spiders

Most search engines do their own type of search on the web to propagate their search database. To gather information programs called Robots are sent out to scan the web. Robots (also called Spiders, Scooters, Worms, Web Crawlers, and Web Wanderers) automatically gather and index information and then put that information into their databases.

Robots basically work like this. Someone submits their site to a search engine, the search sends out a Robot to index the submitted site. But when the Robot finds a page with links, it will then follow the links on that page as a source for new URLs to index. Users can than construct queries to search these databases to find the information on the submitted site, or on sites which were linked to from that site. Maybe even to yours.

Certain robot implementations can overloaded networks and servers. This happens especially with people who are just starting to write a robot, or by hackers who are intentionally trying to cause havoc. But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions.

There maybe files or directories on your site that you don't want indexed by a robot. In these cases, you can prevent any directory from being indexed by creating a robots.txt file for your website. The complete Standard for Robot Exclusion can be found on RobotsTxt.org. The following is a basic design for these files.


Robot.txt files must be placed in the root directory of your site. Once you've moved the file to your site, test it! example: http://mydomain.com/robot.txt.

Robot.txt files consist of 3 components. Defining the user-agent (search engine), the disallow directive, and a comments directive. You can define multiple sets of directives in your robot.txt file, giving various instructions to different robots. For instance, you might want to allow Google to search a certain directory, but exclude Yahoo from that directory.

User-agent Directive:
The User-agent identifies the robots you want to exclude from your site. You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders. If you don't know how to check your logs, or don't want to continually add user-agents to your list each time a new one appears, you can also use a wild card and catch them all.
example:
This line identifies the Google Robot. Use this type of directive to specify different criteria for different search robots.
User-agent: googlebot
This line will identify all robots. Any disallow directive you place after this agent will be applied to all search robots that try to spider your site.
User-agent: *

Disallow Directive:
These lines specify files and/or directories that cannot be spidered or downloaded by the search robot. At least one disallow line must be present for each User-agent directive defined in your robot.txt file.
example:
This line will exclude the file 'bookmark.htm' from the search robot:
Disallow: bookmarks.htm
You can specify a directory and block spiders from it's entire contents. This is especially a good idea for executable programs kept in your cgi-bin directory, and/or your images directory:
Disallow: /cgi-bin/
 
Comments & White Spaces:
A line in the robots.txt that begins with a # sign, is considered to be a comment. It's a good way to organize your directives and remind yourself why you excluded this file or directory. The standard allows for comments at the end of directive lines. I don't recommend this method, some spiders cannot interpret the line correctly and instead will attempt to block the entire line, instead of what you're defining.
example:
Use:
# Block images:
Disallow: /images/
 
Instead of:
Disallow: /images/      # Block images:
A robot might try to interpret this as 1 complete file name: /images/#Blockimages:
 
A blank space at the beginning of a line is allowed, but not recommended.
    Disallow: bob #comment
 
A few basic examples:
Allow all robots to visit all files on your site:
User-agent: *
Disallow:
 
Block all robots from your site:
User-agent: *
Disallow: /
 
Block all robots from the cgi-bin and images directories of your site:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
 
Block all robots from the cgi-bin and images directories of your site. And from the sub-directories in the products directory. This will allow a robot to scan the menu page of your products, but block them from the individual pages maintained in the dresses and pants directories.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /products/dresses/
Disallow: /products/pants/
 
Block the Google bot from your entire site:
User-agent: googlebot
Disallow: /
 
Block the Googlebot from the bookmarks.htm file:
User-agent: googlebot
Disallow: bookmarks.htm
 
Block the Googlebot from the bookmarks.htm file, and the RoverDog bot from the cgi-bin directory, and use a comment line to separate each directive set:
User-agent: googlebot
Disallow: bookmarks.htm
#
User-agent: Roverdog
Disallow: /cgi-bin/

 Next: Tips & Tricks Menu