This document is available on the Internet at:  http://urbanmainframe.com/folders/blog/20040323/folders/blog/20040323/

Defending Against Comment Spam

Date:  23rd March, 2004

Tags:

One of the problems with running a website that allows users to post content (whether "comments", forum posts or even just a "contact us" form) is that the systems that provide the interactivity are also a channel that spammers can use to pass on their anti-social messages.

The problem of "comment spam" is now well documented. I have never suffered such spam via the Urban Mainframe as neither of my visitors would post such rubbish. Nevertheless, I have implemented a couple of upgrades in order to prevent against such attacks in the future...

It's just so damn easy for the spammer to attack a website: Enter garbage into an input form, "click submit", hit the browser's "back" button, "click submit" - repeat the last two steps as many times as required to get your "message" across. Automate this process where possible (ie: in almost every possible case).

How can a webmaster defend against this?

On the previous incarnation of the Urban Mainframe's comment system, spamming was all too easy. The system is deliberately open for anyone to post to, in order to encourage participation. Only the comments themselves are required input and there is no validation of email address or URL (if they have been submitted) beyond a check for structural correctness... welcome to spam city.

Despite not having a comment spam problem myself, I thought I'd better prepare some defences - for when I get my third visitor.

I had long been familiar with the CAPTCHA Project. A CAPTCHA is a "program that can generate and grade tests that most humans can pass [and that] current computer programs can't pass." Which is exactly what is needed to protect against automated spamming tools. So I decided to implement a CAPTCHA clone on the Urban Mainframe.

The concept is simple, yet effective. The input forms include a embedded image within which is a word, number or other string of characters - usually distorted in some way. When submitting a CAPTCHA-protected form, the user has to enter the string displayed in the image along with his regular input. On the server side, an application tests the user's input against the graphical string and, if they match, the post is accepted. Otherwise it is declined.

Spamming software will not be able to successfully post since the software is almost certain to be unable to process the graphical text. It's conceivable that a high-tech and very elaborate spammer might run an OCR program in conjunction with his other software, but this is extremely unlikely. If such a problem did manifest itself, it would be easy to degrade the CAPTCHA image to the point where a human would still be able to decipher the embedded text but an OCR program, however advanced, would not.

The CAPTCHA, in all its glory!

The Plan

I wanted my CAPTCHA system to be unobtrusive, legible (for humans), small and efficient. I decided that I wouldn't distort the graphical text, since I want it to be usable by as many people as possible. I figured that simply rotating the text by a few degrees and employing low contrast would frustate any attempts to use OCR against the image.

In order to make the system as user-friendly as possible I set the following rules:

  • the only characters that are used are "a" - "z" (no punctuation or numerical digits)
  • only lower-case characters are displayed
  • the user's input will be tested without case sensitivity (many users work with [Caps Lock] permanently on)
  • rather than use random sequences of characters, I would use dictionary words as these would be cognitively easier to process  - which would help the user considerably if I ever adjust the system to distort the images
  • I would use no word longer than four characters

The first three items are all easy to achieve. For the fourth and fifth, I took the "linux.words" file and, with a little Perl magic, extracted all the words with a length of four characters or less to produce a "captcha.dict" file. This file simply consists of a single word per line, which would make selecting a word easy.

Putting it Together

I began by creating a few background images ("backgrounds.zip"), which would hopefully help in "muddying the waters" against any OCR-based attack. I would select a background image at random, for each CAPTCHA, with the following (Perl) code:

# Generate a new seed for the random number generator...
my (%temp);
srand ( time() ^ ($$ + ($$ << 15)) );

# Get a list of available background images...
opendir (DIRECTORY, '/path/to/background/images');
    my @base_images = grep /^captcha_bg_/, readdir(DIRECTORY);
closedir (DIRECTORY);

# Choose an image...
$temp{'base_image'} = '/path/to/background/images/' . $base_images[rand @base_images];

I then choose a random word from the dictionary ("captcha.dict") and convert it lower-case characters:

# Choose CAPTCHA word...
my (@captcha_dictionary);
open (DICTIONARY, '</path/to/dictionary/captcha.dict'); 

    rand($.) < 1 && ($temp{'word'} = $_) while <DICTIONARY>;
    chop $temp{'word'};

close DICTIONARY;
$temp{'word'} = lc $temp{'word'};

Now I combine the two using PerlMagick (an object-oriented Perl interface to ImageMagick):

# Create the "CAPTCHA"...
$temp{'CAPTCHA'} = '/path/to/temp/image/directory/token_' . int (rand 9999) . '_' . time() . '.gif';
use Image::Magick();
my $q = Image::Magick->new;
$q->Read($temp{'base_image'});
$q->Annotate(gravity=>'center', pointsize=>18, font=>'helvetica', rotate=>-20, fill=>'#c0c0c0', text=>$temp{'word'}, antialias=>1);
$q->Write($temp{'CAPTCHA'});
undef $q;
chmod 0666, $temp{'CAPTCHA'};

At this point I can now display the CAPTCHA and the input field. But, I need to be able to compare the input word with the CAPTCHA word, thus I need to know what the CAPTCHA word is when the user submits the form - in the stateless environment of the web, I need to maintain state.

The very first attempts to mimic a stateful web used the URL to pass session information between the client and server. Long and complex URL's were (and still are) widely used all over the web.

One of the most popular techniques involves the use of hidden form fields. Essentially, this involves encoding the session information in the HTML code of web pages and delivering it back to the server with a form submission.

Netscape 4 then introduced the concept of "cookies", small text files that are stored locally by the client browser and which will return their contents to the web-server upon request. The value of cookies was immediately apparent to those developers who needed some form of state maintenance.

In many cases, the cookie is the ideal medium for state maintenance. Unfortunately, in the early days of cookies, there was a large amount of FUD surrounding security and privacy concerns. This was followed with reports of web advertisers using cookies to track web users. Thus cookies have become tarnished in the eyes of the web user - many disable the functionality in their browsers and large IT departments, who really should know better, block cookies at the gateway or firewall, denying all their users this functionality. Thus, it is not safe to rely solely on cookies for carrying state information.

Modern web applications tend to use the more secure and sophisticated principle of "server-side sessions", where the state information is stored in a database on the server along with a unique client key. The client can send its key to the server as part of the URL, a hidden form field or it can be stored in a cookie and requested by the server when required. The server uses the unique key to access the corresponding information in the sessions database and thus the illusion of a stateful web can be delivered.

I always avoid using the URL to pass state information to the server, except on special web pages - like search results, where it is useful if such a URL can be bookmarked or emailed. My main reason for not using the URL method is that I believe that long, cryptic URL's are extremely inconvenient to the user. I would also avoid using cookies for essential state information as one cannot assume that the facility is available on the client side (I could use JavaScript to detect cookie support, but that's unwieldy).

I decided to go with server-side sessions. As each CAPTCHA is created, I store the word used, a timestamp (which will be used to "expire" sessions) and a session key in an SQL database on the server. I then use a hidden form field to tie the CAPTCHA and key together.

The user enters the CAPTCHA word and completes the rest of the input form before clicking "submit". The server then takes the session key (which was sent in the hidden field), retrieves the CAPTCHA word from the database and compares it with the word the user has entered. If it matches, then the form submission is successful, if it doesn't - the user will be given a message to that effect and be offered the chance to retry.

Cleaning Up

A new CAPTCHA is created (along with its corresponding session record) every single time a protected form is served. Once a successful submission is made against any given CAPTCHA, then both the image and the corresponding session information in the database are destroyed. These two mechanisms help us defend against some programmatic attacks.

However, this means that if the user visits a protected form but doesn't submit an entry, we are left with an image and database record that are immediately obsolete. It is obvious that this system, left unchecked, is rapidly going to consume resources. The system needs to clean up after itself. Fortunately, this is easy to accomplish.

Let's give each CAPTCHA image a time-to-live (TTL) of 60 seconds - which is long enough to ensure that it is served to even the slowest client. Therefore, we can delete any CAPTCHA image that is older than 60 seconds:

# Delete any CAPTCHAs that are older than 60 seconds...
opendir (DIRECTORY, '/path/to/temp/image/directory');
    my @tokens = grep /^token_/, readdir(DIRECTORY);
closedir (DIRECTORY);
foreach my $cleaner (@tokens) {
    if ($cleaner =~ /^(token_(\d){1,}_(\d){1,}.gif)/so) { $cleaner = $1; }
    $temp{'age'} = (time - (stat("/path/to/temp/image/directory/$cleaner"))[9]);
    if ($temp{'age'} > 60) { unlink "/path/to/temp/image/directory/$cleaner"; }
}

We need to maintain the session information longer than the image, for the simple reason that a user might take a considerable time to complete a given form. I decided that 6 hours should be more than sufficient:

# Destroy any sessions that are older than 6 hours...
my $sth = $dbh->do(qq{DELETE FROM session WHERE date_sub(now(), interval 6 hour)>timestamp});

These cleanup tasks are run periodically.

Preventing Duplicate Posts

Each comment is checked against the existing comments for a given page. If a comment is posted in a thread that exactly matches an existing comment, within one hour of the timestamp of the existing comment, then it is rejected. This serves two purposes, it hinders the comment spammer and it protects against the user who clicks the submit button more than once.

Preventing the Manual Spammer

The CAPTCHA is a great device for hampering automated posting tools. But some "marketeers" are quite happy to post their messages manually and the CAPTCHA offers no defence in these cases. As the sole purpose of the comment spammer is draw traffic to a website we can render their efforts useless by "blacklisting" the URL's they are trying to promote.

All we need to do then is to check all incoming posts, if a blacklisted URL is found within any of the submitted fields, then we reject the entire post.

Unfortunately the maintenance of the blacklist is a manual task, but the effort is trivial when one considers the benefits.

Accessibility

So why doesn't every website use CAPTCHAs to protect its forms? The biggest problem with the system is that, being a visual gadget, it is completely inaccessible to visually-impaired users. Sadly, there is no easy or obvious way around this.

Fortunately, on the Urban Mainframe at least, there is a solution. The comments system (and forum) each have two operating modes. One for registered users and one for non-registered users. Since the registration process is hostile to casual registrations, I can safely assume that our registered users are "friendly". Therefore, registered users (who are logged-in) never see the CAPTCHA device. Thus I have an accessible channel to the comments system for visually-impaired users. All I need to do is add a few guidance notes to the CAPTCHA (or to an accessibility statement) to that effect and that should ensure that accessibility isn't compromised.

Conclusion

So I now have an veritable arsenal of weapons to defend against the growing problem of comment spam. Hopefully, the principles I have described here will be useful to others too.

If there are any obvious flaws in these systems, or issues that I should consider and which are not discussed here, then please leave a comment and let me know.

At the time of writing, these new features don't seem to hinder genuine posters significantly but, again, I welcome feedback both positive and negative.

Afterword

I seem to have been on a crusade against spam in recent weeks. Check out these related entries:

Update (26-Mar-2004): I have changed the algorithm that chooses the random word following the advice kindly offered by Dan. The system no longer loads the whole word dictionary into RAM, it randomly selects a word directly from the file - thus improving performance marginally. Thanks very much for the tip Dan.

Update (7-Feb-2005): I might have to tweak the Captcha soon, or even replace it with something entirely new. Captcha-decoding software is getting better all the time.

See Also

Editing Blog Comments