E-Myth Blog

.

Website Tips: Block the Bots

2008 | Sep 4 in Home Page News , Marketing , Lead Generation

By Ross MacLeod

Our article Simple Search Engine Optimization discussed how to increase the 'findability' of your web pages. This article discusses a file called 'robots.txt' which can keep search engines away from certain public pages and which can direct search engines to your public sitemap.

Robby the Robot in 'Forbidden Planet' (MGM)

Block the Bots

Many websites will have public pages that should not be visible in search results, such as cobranded promotions with discount prices, expired offers, code, style sheets and broken pages. Truly sensitive material should always be served from behind login/password challenges, but promotional pages can inadvertently be indexed, and discount codes can be distributed more widely than was originally intended. How can you keep pages away from the major search engines? By telling them what they can and cannot inspect using a file named robots.txt.

robots.txt

'robots.txt' is an optional text file that can be placed in your website's document root directory. If it exists, the robots.txt file of any website can be viewed simply by typing the domain URL plus '/robots.txt', e.g., http://www.nytimes.com/robots.txt. Software crawlers (or robots) from many search engines (including Google's googlebot, Yahoo's slurp and MSN's msnbot) will read robots.txt and use the contents as instructions for indexing a website's contents. There are no rules forcing crawlers to use robots.txt, so there are no guarantees that indicated files will stay hidden, but the major search engines make use of robots.txt.

At any given time there may be thousands of active search crawlers, finding and saving information deemed valuable. A software robot that can deeply inspect a list of websites is relatively easy for a programmer to write, and searching websites en masse is commonly done, e.g., spammers will search for email addresses and copycat vendors will collect prices and republish product data. The take-home lesson is that there is no privacy of information for your public web pages, but a well-crafted robots.txt file can keep selected pages off the big results lists.

Examples

Here are examples of robots.txt instructions. Individual search robots (user-agents) can be specified or all agents can be indicated with an asterisk (*). The first example allows all search robots (user-agents) to inspect and index the whole site. This is equivalent to having no robots.txt file:

# (Comments are prefaced with #)
# Allow all user agents to index all site pages.

User-agent: *
Disallow:

The next example instructs all search robots to ignore the whole site.

# Disallow all user agents from indexing any site pages

User-agent: *
Disallow: /

Let's say your website has product discounts for your partner companies on the pages below; one discount for company Alpha and another for company Beta. Here's a way to keep those discount pages out of major search results listings:

# Disallow all user agents from discount pages

User-agent: *
Disallow: /partners/Alpha/discount.html
Disallow: /partners/Beta/discount.html

Easier than listing every page is to hide whole directories. The next example acts for all Alpha and all Beta pages, and everything in the /partners directory:

# Disallow all user agents from partner pages

User-agent: *
Disallow: /partners/

Also useful is this HTML element:

  <META NAME="robots" CONTENT="NOINDEX, NOFOLLOW, NOARCHIVE">

This meta tag can be placed on individual pages and can provide another layer of protection. Not all web crawlers recognize this meta tag, but it will keep Google, Yahoo and MSN from indexing, following or caching pages. This meta tag can be redundant, but may be useful if robots.txt is temporarily missing, contains typos or if directory structures change.

Sitemaps

A representation of your website can be built into a special 'sitemap' XML format and referenced in your robots.txt file. A sitemap presented to a search engine crawler will allow for fast, accurate indexing of your site's content. CNN's robot.txt shows use of sitemap references.

Further Reading

Robot Exclusion Standard

Robots.txt at Wikipedia

Google's Webmaster Guide

Sitemaps

Site Examples

The New York Times

The Wall Street Journal

The White House

McDonald's Restaurants

E-Mail It Bookmark This Page

Comments

  1. .Shane O. says:

    Thank you for the info. Great information, yet again you open up my eyes to see the future.

    Submitted Sep 5, 2008 2:04 AM

  2. .Goldson E. says:

    This information comes in so helpful to me now. I am off to discuss this with my web consultant. My subcription and time is worth every bit of the effort.

    Submitted Sep 8, 2008 9:22 AM

  3. .HEPBURN W. says:

    Wow!! I had'nt a clue! (like many). Superuseful info. Thanks.

    "Reassesmnt time"

    Submitted Sep 17, 2008 1:54 AM

  4. .F. S. says:

    As a webmaster, you definitely should use user-agent info to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.

    I wrote more about this here:

    Webmaster Tips: Blocking Selected User-Agents
    http://faseidl.com/public/item/213126

    Submitted Sep 20, 2008 7:58 PM

Add Comment

Copyright © 2006-2008 E-Myth Worldwide, Inc.