This document is available on the Internet at: http://urbanmainframe.com/folders/blog/20040811/folders/blog/20040811/
I have recently realised that I have been doing some visitors to the Urban Mainframe a disservice.
I've confused them, I've wasted their time and I've lost their trust.
I'm not referring to my regular visitors, I'm talking about new visitors - more specifically, those who have come to the Urban Mainframe via a link from a search engine.
My crime? Allowing the search engines to index content that changes frequently...
“most of those users will probably never return to my website”
Like many bloggers, I occasionally take a peek at my referrer logs (say, once every two minutes or so). Every now and then I'll click one of the links in one of those logs for an ego-gratifying trip to the referring page.
The vast majority of these referrer links are from Google, or one of the other search engines, and I had been analysing my referrer logs for two years (not continuously!) before I realised that I had a problem.
I noticed that many of the search engine matches linked back to the home-page of my weblog. The problem, as I quickly discovered when I followed some of these links back to my website, is that that page changes frequently - sometimes several times per day (most noticeably in the case of the "Fresh Meat" linkblog). Thus, many of the search engine's matches were for items that had previously been on the weblog front-page, but had since been shuffled off into the archives. So, visitors were entering my website via an explicit search engine match, yet were unable to find the corresponding content - because the page they were visiting had changed after the search engine had indexed it!
Remember: Search engines don't re-index dynamic pages every time they change - at least not yet.
Therefore, based on my own browsing habits, I suspect that the majority of those visitors, on being unable to find what they were looking for, immediately clicked their browser's "Back" button in order to select the next result from the search engine (and hopefully find what they were looking for). Most of those users will probably never return to my website.
Naturally, I want the search engines to index my content, every last word of it. But I also want my website to be user-friendly and "honest" - I don't want to confuse new visitors, each of whom is a potential return visitor, with inaccurate search engine matches. So I want the search engines to index my permanent content, not my indices. However, the search engine needs to scan the indices in order to find that permanent content!
The solution is remarkable simple: a well-behaved search engine robot (also known as a "crawler", or "spider") can be instructed as to how it should navigate a website and what content it should index on its travels.
We can direct the behaviour of the robot in one of two ways:
It's also possible to use a combination of both - a "robots.txt" file for global directives along with page-specific META tags for individual pages.
Fixing the Urban Mainframe's indexing problem was simple: I added the following markup to my indices:
<meta name="robots" content="noindex, follow" />
To the search engine robot, this simple markup says, "please don't index this page, but follow the links on it because they could lead to content that should be indexed."
I didn't need to add any markup to the pages containing my permanent content (the pages that should be indexed).
This little exercise made me wonder which of my favourite websites were using robot directives on their indices and which weren't. I chose to examine a completely random selection of websites from my blogroll in order to find out. The results may surprise you:
I found many more examples of websites with bad or non-existent robot control, but I think there are enough listed here to illustrate that the problem is quite widespread.
I suspect that an incredible number of users are following search engine results through to pages where the content has little or no correlation to initial search query, simply because the dynamic indices of the target websites have changed after the search engines have indexed them. This equates to a vast number of frustrated web users and the tragedy is that the problem is entirely and easily avoidable.
It's time webmasters paid more attention to robot directives.
Brad Choate has devised a clever way to control indexing with more granularity than either META tags or "robots.txt" provides. However, this might contravene a restriction that Google has on "cloaking".