Robots.txt - SEO Articles





Robots.txt

Robots.txt is a simple text file that is used to inform the search engine spiders on which pages to crawl and which pages not to crawl. This file is considered one of prominent factors in optimization of website pages. Most of the spiders index web pages based on the contents of the robots.txt file.

Basics of robots.txt

Robots.txt file is placed in the root directory of a website. For an example to block google from crawling a webpage “privacy.htm” use the following code inside robots.txt

User Agent: Googlebot-Image
Disallow: /privacy.htm

Here the “User Agent” field is used to specify the name of the spider or crawler and Disallow field is used to specify the part of the website to be blocked.

For example to block the whole website from a google spider the following code can be used.

User Agent: Googlebot-Image
Disallow: /

Therefore, a simple forward slash like “/” disallows all the pages of the website to be crawled.

Using “*” in the user agent field disallows all the search engine spiders from crawling a webpage. To prevent a directory “mydir” being crawled from all spiders

User Agent: *
Disallow: /mydir/

To allow a specific search engine to crawl a specific webpage “terms.htm”, the following code can be used.

User Agent: Googbot-image
Allow: /terms.htm

Cautions

It is always better to have robots.txt file in a website to index pages with quality content alone as required. But this file has to be created carefully, since a simple slash could block the spiders from crawling the whole website. It is suggested to avoid including a robots.txt file, if you are not very clear about creating this file. Rather get an expert to design this file, which when designed well can earn good page ranks for your website.

It is a must

In a real time environment the hosting statistics can be checked to find a fair percentage of hits to the robots.txt file. Since this file is not visited by internet users, one can be sure that all the requests are from search engine spiders. But if the robots.txt file is not included, the crawlers visit all pages in all available folders. Even some search engines recommend robot.txt file, so it testifies the fact that search engine spiders definitely look for this file.

Following are the bots or spiders of some search engines.

Search EngineSpider
GoogleGooglebot
Alexaia_archiver
LookSmartMantraAgent
LycosLycos_Spider
AltaVistaScooter
WisenutZyborg
AtomzAtomz
WhatUSeekWinona
AlltheWebFAST-WebCrawler
ScrubThe WebScrubby