This document is available on the Internet at:  http://urbanmainframe.com/folders/blog/20040811/folders/blog/20040811/

Google: Don't Index This!

Date:  11th August, 2004

Tags:

I have recently realised that I have been doing some visitors to the Urban Mainframe a disservice.

I've confused them, I've wasted their time and I've lost their trust.

I'm not referring to my regular visitors, I'm talking about new visitors - more specifically, those who have come to the Urban Mainframe via a link from a search engine.

My crime? Allowing the search engines to index content that changes frequently...

“most of those users will probably never return to my website”

Like many bloggers, I occasionally take a peek at my referrer logs (say, once every two minutes or so). Every now and then I'll click one of the links in one of those logs for an ego-gratifying trip to the referring page.

The vast majority of these referrer links are from Google, or one of the other search engines, and I had been analysing my referrer logs for two years (not continuously!) before I realised that I had a problem.

The Problem

I noticed that many of the search engine matches linked back to the home-page of my weblog. The problem, as I quickly discovered when I followed some of these links back to my website, is that that page changes frequently - sometimes several times per day (most noticeably in the case of the "Fresh Meat" linkblog). Thus, many of the search engine's matches were for items that had previously been on the weblog front-page, but had since been shuffled off into the archives. So, visitors were entering my website via an explicit search engine match, yet were unable to find the corresponding content - because the page they were visiting had changed after the search engine had indexed it!

Remember: Search engines don't re-index dynamic pages every time they change - at least not yet.

Therefore, based on my own browsing habits, I suspect that the majority of those visitors, on being unable to find what they were looking for, immediately clicked their browser's "Back" button in order to select the next result from the search engine (and hopefully find what they were looking for). Most of those users will probably never return to my website.

Naturally, I want the search engines to index my content, every last word of it. But I also want my website to be user-friendly and "honest" - I don't want to confuse new visitors, each of whom is a potential return visitor, with inaccurate search engine matches. So I want the search engines to index my permanent content, not my indices. However, the search engine needs to scan the indices in order to find that permanent content!

Controlling the Robots

The solution is remarkable simple: a well-behaved search engine robot (also known as a "crawler", or "spider") can be instructed as to how it should navigate a website and what content it should index on its travels.

We can direct the behaviour of the robot in one of two ways:

  • We can deploy a small text file on our web-server called "robots.txt", containing a rule-set for our entire website, or
  • We can use a Robots META tag to direct the robot on a per-page basis.

It's also possible to use a combination of both - a "robots.txt" file for global directives along with page-specific META tags for individual pages.

Fixing the Urban Mainframe's indexing problem was simple: I added the following markup to my indices:

<meta name="robots" content="noindex, follow" />

To the search engine robot, this simple markup says, "please don't index this page, but follow the links on it because they could lead to content that should be indexed."

I didn't need to add any markup to the pages containing my permanent content (the pages that should be indexed).

Other Websites

This little exercise made me wonder which of my favourite websites were using robot directives on their indices and which weren't. I chose to examine a completely random selection of websites from my blogroll in order to find out. The results may surprise you:

  • Binary Bonsai: good robot control via META tags
  • Gadgetopia: no robot directives
  • MacBlog: no robot directives
  • The Daily WTF: no robot directives
  • 456 Berea Street: robot control via META tag - however, the directives actually request indexing of the dynamic home page!
  • Signal vs. Noise: no robot directives
  • Acts of Volition: no META directives, has a "robots.txt" file but it does not prevent indexing of the dynamic home page
  • SimpleBits: no META directives, has a "robots.txt" file but it does not prevent indexing of the dynamic home page
  • sidesh0w: robot control via META tag and extensive "robots.txt" - "robots.txt" excludes indexing of the dynamic front page but META tag on that page requests indexing - I'm not sure how this will the affect the robots
  • Dunstan Orchard (1976 Design): robot control via META tag and "robots.txt" file - indexing of dynamic pages excluded. It's no surprise to find Dunstan doing it right!
  • noscope: robot control via META tag - however, the directives actually request indexing of the dynamic home page
  • A Whole Lotta Nothing: no robot directives
  • Asterisk (D. Keith Robinson's weblog): robot control via META tag - however, the directives actually request indexing of the dynamic home page
  • Digital Web Magazine: no robot directives
  • A List Apart: robot control via META tag - however, the directives actually request indexing of the dynamic home page
  • Cameron Moll's Authentic Boredom: no robot directives

I found many more examples of websites with bad or non-existent robot control, but I think there are enough listed here to illustrate that the problem is quite widespread.

I suspect that an incredible number of users are following search engine results through to pages where the content has little or no correlation to initial search query, simply because the dynamic indices of the target websites have changed after the search engines have indexed them. This equates to a vast number of frustrated web users and the tragedy is that the problem is entirely and easily avoidable.

It's time webmasters paid more attention to robot directives.

Adding Selectivity

Brad Choate has devised a clever way to control indexing with more granularity than either META tags or "robots.txt" provides. However, this might contravene a restriction that Google has on "cloaking".