Use robots.txt Disallow directive to forbid spiders and search engine robots

by Yang Yang on July 31, 2009

Just like .htaccess, robots.txt resides at the document root of your domain. It’s a text configuration file containing directives or rules any well behaved web spiders or search engine robots should respect. While you can use .htaccess to forcibly prohibit any visits (including those of human visitors) to a certain part of your site, robots.txt just deals with automated web page spiders such as googlebot.

To forbid any robot spiders to access and index /includes/ and /search/ directories of your site, simply write a robots.txt file and put in the following rules:

User-agent: *
Disallow: /includes/
Disallow: /search/

The asterisk * stands for any robot. By these rules, all robot spiders should not access nor index /includes/ and /search/. This is a good way to protect sensitive data and stop search engines from indexing certain part of your site.

Similarly, you can write rules targeted at a specific search engine:

  1. GoogleBot – Google
  2. Slurp – Yahoo
  3. MSNBot – Bing

(Note that these search engine providers may very probably have more than one bots, the listed bots are just the most common ones at present.)

For example, to prohibit Google from accessing and indexing /ihategoogle and any web documents under it, use rule:

User-agent: GoogleBot
Disallow: /ihategoogle/

There is no Allow directive

Therefore, to allow a spider to access your site, say nothing or:

User-agent: GoogleBot
Disallow: 

To allow a single bot such as GoogleBot to have the only privilege to access your entire site:

User-agent: GoogleBot
Disallow: 

User-agent: *
Disallow: /

Previous post:

Next post: