|
Most search engines do their own type of search on the web
to propagate their search database. To gather information programs
called Robots are sent out to scan the web. Robots (also called
Spiders, Scooters, Worms, Web Crawlers, and Web Wanderers) automatically
gather and index information and then put that information into
their databases.
Robots basically work like this. Someone submits their site
to a search engine, the search sends out a Robot to index the
submitted site. But when the Robot finds a page with links, it
will then follow the links on that page as a source for new URLs
to index. Users can than construct queries to search these databases
to find the information on the submitted site, or on sites which
were linked to from that site. Maybe even to yours.
Certain robot implementations can overloaded networks and
servers. This happens especially with people who are just starting
to write a robot, or by hackers who are intentionally trying
to cause havoc. But at the same time the majority of robots are
well designed, professionally operated, cause no problems, and
provide a valuable service in the absence of widely deployed
better solutions.
There maybe files or directories on your site that you don't
want indexed by a robot. In these cases, you can prevent any
directory from being indexed by creating a robots.txt
file for your website. The complete Standard for Robot Exclusion
can be found on RobotsTxt.org.
The following is a basic design for these files.
|