Digital Security Reports: Battle The (google) Bots
Back before there was Google, the big new search engine out there was AltaVista. In an effort to show off its power, the AltaVista team from Digital decided to crawl and index the entire web, which was a new concept at the time. There were many who didn't like the idea of a "robot" program accessing every page on their web sites because it would cause more load time to their web servers and increase bandwidth costs for them. To address their growing concerns, in 1996 the Robots Exclusion Standard was created.
Using a simple text file called robots.txt you can instruct search engines to stay out of certain directories. Here is a very simple robots.txt which disallows all search engines (User-agents) access to the /images directory.
User-agent: * Disallow: /images
By disallowing /images you are also implicitly disallowing all subdirectories under /images, such as /images/logos and any files beginning with /images such as /images.html.
The first draft of the standard did not include an "Allow" directive. It was added later, but there is no guarantee it's supported by all search engines. Anything that was set to be specifically disallowed was considered fair game to web crawlers.
If you choose to disallow access to your entire web site, you can use a robots.txt like this:
User-agent: * Disallow: /
The next lines apply to every search robot when the User-agent is *. Through the specification of the signature of a web crawler as User-agent specific instructions can be given to such a search robot.
User-agent: Googlebot Disallow: /google-secrets
Since the initial specification was issued, some search engines have expanded the protocol. An example of this is to permit the use of wildcards.
User-agent: Slurp Disallow: /*.gif$
As a result, Yahoo!'s web crawler (named Slurp) cannot index files on your site if they end in .gif. You do need to preface these lines with the requisite user-agent line, since not every search engine presently supports wildcard matches.
You can merge a number of these practices into one robots.txt file. To illustrate that theory, here is an instance.
User-agent: * Disallow: /bar User-agent: Googlebot Allow: /foo Disallow: /bar Disallow: /*.gif$ Disallow: /
Computer programs are pretty good at following instructions like these. But for a human brain it can quickly get overwhelming, so I highly encourage you to keep it simple.
Google's webmaster tools includes a robots.txt analysis tool that is very highly recommended. For more information on the Robots Exclusion Standard, point your browser to www.robotstxt.org.
Today when companies are spending a lot of money to be included in search engine listings, the idea of excluding your content may seem quaint. But from a security perspective there are many valid reasons for limiting what a search engine indexes on your site. See my Digital Security Report for more information.
Nick Dalton's blog is TipsTricksToolsTechniques.com where he regularly shares tips on Internet security. Also worth checking out is his latest report called The Digital Security Report it has essential advice for Internet business owners selling products online.
Published November 8th, 2007
Filed in Ecommerce, Search Engine, Web Design
