flag of the United Kingdom
URBAN
Mainframe

User Comments

(for: Defending Against Comment Spam)
1 | Posted by: Marie S. (Registered User) | ~ 2 years, 9 months ago |

Great - some more code. Cheers Jon!

Now I’m going to log out and try the captcha (hope I spelt that correctly).

2 | Posted by: Marie S. (Guest) | ~ 2 years, 9 months ago |

Looks good so far. By the way, the comments page no longer validates as xhtml since you added the cpatcha. You might want to look into that.

3 | Posted by: Marie S. (Guest) | ~ 2 years, 9 months ago |

Hey it works well. That’s pretty damn cool mate. Good work. This will be really handy on my guestbook program —if I ever finish it.
Thanks again.

4 | Posted by: ==awesum== (Guest) | ~ 2 years, 9 months ago |

Thisis good but ImageMagick is tricky to install. But if its already installed then your ideas is very clever. I have seen thuis on other websites too I think it is on Yahoo.

5 | Posted by: DarkBlue (Registered User) | ~ 2 years, 9 months ago |

Marie: Glad you like it, I’m delighted that you have found a use for it. Thanks for bringing that validation issue to my attention, I’ll sort that out in the next day or two.

==awesum==: I had never had any problems installing ImageMagick but I am aware that some people find it troublesome.

I don’t install it directly though. I use Perl’s CPAN (http://www.cpan.org/) to install the PerlMagick package and this package installs ImageMagick itself.

So, if you need ImageMagick but are having trouble installing it, use CPAN from the *nix shell:

perl -MCPAN -e shell
install Image::Magick

This will fetch all the relevant files, “make”, “make test” and “install” everything you need.

6 | Posted by: Jennifer Grucza (Guest) | ~ 2 years, 9 months ago |

Hey, great article - you really covered all the bases! I kept thinking about the accessibility issue for blind users, and right there at the end you’ve got it addressed. Great job.

I don’t know if it’s possible given the software you’re using, but you might want to try playing around with streaming the image directly to the browser instead of temporarily saving it to the filesystem. I’m doing that in the charting portion of my company’s software using the jCharts charting package, which nicely provides a method for doing so.

7 | Posted by: DarkBlue (Registered User) | ~ 2 years, 9 months ago |

Jennifer, you’re right - streaming the image is exactly what I need to do. That would make the system more secure and would eliminate one of the clean-up tasks.

This was my original plan and I spent hours and hours trying to “make it so”, but I was ultimately unsuccessful and so I chose to write the image to the file-system and access it from there.

If anyone can offer me any information on how to stream an image I would be eternally grateful…

8 | Posted by: Dan (Guest) | ~ 2 years, 9 months ago |

Nicely done. Don’t know if you want a hint for your perl code though - no need to read in the entire dictionary just to pick a random word - see ‘perldoc -q “random line”’.

9 | Posted by: DarkBlue (Registered User) | ~ 2 years, 9 months ago |

That’s worth knowing Dan. Thanks for the hint, I’ll make that change right away.

Cheers.

10 | Posted by: Noah (Guest) | ~ 2 years, 5 months ago |

As a developer I understand the comment spam problem and everything that goes with it. Dispite this I still find your attitude to disabled people dissapointing.

“Thus I have an accessible channel to the comments system for visually-impaired users. All I need to do is add a few guidance notes to the CAPTCHA (or to an accessibility statement) to that effect and that should ensure that accessibility isn’t compromised.”

The very reason people suffer from disablement is because things are made harder for them to do.

Yes, it is possible for a dissabled person to post comments - but you are actively dissabling them further by making them jump through unnessesary hoops.

In either case you are also cutting out anyone who cannot or choses not to display images.

I do not beleave that any visual CAPTCHA system is satisfactory. Any system must be fully useable from Lynx in my opinion.

11 | Posted by: DarkBlue (Registered User) | ~ 2 years, 5 months ago |

Noah you are absolutely right. I am not claiming that my system is perfect and I acknowledge that some users will be terribly inconvenienced by the Captcha.

I wrote in Defending Against Comment Spam, “I have never suffered [comment] spam via the Urban Mainframe.”

Why then, with no spam problem, did I implement the Captcha? There were two reasons:

  • As a programming exercise
  • To test its practicality

In practical terms, yours is the first serious criticism I have received of the system. Now, to be perfectly honest, I don’t know how significant that is. I don’t know if I have any disabled readers. I don’t know if the Captcha is preventing non-disabled readers from commenting. I have no metrics, no empirical data.

I welcome feedback on all aspects of this website: functionality, design, implementation, UI, architecture, navigation, content, etc. Without that feedback, I cannot make any sound decisions as to what works and what doesn’t.

From a functional perspective, I designed the Captcha to be switchable from the start. Thus, if I receive enough feedback indicating that it is a problem, I can simply switch it off.

you are actively disabling them further by making them jump through unnecessary hoops

I don’t like this any more than you do Noah. But this is a difficult call for me - I have little enough time to invest in the Urban Mainframe as it is, without having to moderate comments.

I wish there were a perfect, secure, accessible comments handling mechanism - but I haven’t found one yet.

What do you suggest?

12 | Posted by: Noah (Guest) | ~ 2 years, 5 months ago |

Thank you for taking my comments seriously.

The problem of comment spam and counter measures has been plaguing me recently and I simply can’t stop thinking about it.

There are many, many methods people have come up with, including visual CAPTCHAs through to a whole registration process.

These all have their drawbacks and I have been spending a lot of time thinking of a solution that would enable a casual surfer to easily post a comment without any inconvenience and yet stop robots in their tracks.

Another requirement in my thought experiment was that the system was fully usable under the Lynx browser.

In addition to all of this I realised that although a lot of my solutions would work as a standalone solutions one of of my own personal blogs, I needed to think of something which would scale up so to speak. Something that would work as a MT plugin or where a vast majority of the blogging community adopted the method. Please note that this would also have to work perfectly well as just one - many methods I have seen proposed would require the entire webloging community to participate, which is effect renders them useless.

Then today, on a boring train journey home, it hit me what the solution was!

I realised what the actual problem with implementing this system was.

A guy named Alan Turing wrote a lot about Artificial Intelligence (A.I.) an devised a method to test such machine intelligence. He called his test the Turing Test. If you are not familiar with the test I suggest you do some googling as it is too large a topic to go into here.

Basically, he stated that machines were intelligent when they could trick a human - via some sort of remote computer connection - into thinking that they were also human.

Now what lies at the heart of the comment spam solution problem is also in essence a variation of the Turing Test. We are challenging a human to prove their “human-ness” so to speak.

That in it’s self is not a hard experiment. Imagine your self in a chat room talking to other people. After a few minutes talking to anyone of the other “people” in the chat room I am positive you could determine which, if any, were just cleverly programmed “bots”.

However, the problem in this case is: Getting a computer to judge a Turing Test carried out on a human.

This is in essence the reverse of the problem Turing devised.

While thinking about this it occurred to me that as a culture our programming knowledge is not yet sufficient to build a computer that will reliably pass the Turing Test - so how could we expect to program that could conduct one!

So what we need is a way for a site’s author to personally conduct his own Turing Test on all people who want to comment…

Stupid, ludicrous, impossible I hear you say! Well perhaps not…

Consider this: Every time the site’s author posts a new comment to the site the content management system, or blogging software asks the author to specify a question. The question would be a very simple question that perhaps the youngest of humans could answer. Some examples include:

 * What colour do you make by mixing red and blue?
 * Can birds fly?
 * How many words are in this sentence?
 * What is three plus four?
 * In this post do I talk about comment spam or quantum physics?

The system then asks the author for a set of accepted answers such as:

 * purple, magenta, mauve
 * yes, sometimes, definitely
 * 7, seven
 * 7, seven
 * comment spam

Now, once this simple process has been completed the comment is posted to the site. When a user tries to comment they are challenged, on the same page, with a simple and unique question to answer. Providing the question is simple enough the user should have no trouble providing a typed text answer.

The system would then check the answer against the list of accepted answers accounting for spelling mistakes etc. If the user did not pass the test the comment is simply held back for moderation.

The site’s author can then view all comments awaiting moderation. If the comment is spam, the author could then simply click a button to blacklist the URLs that are contained within the spam.

This system would not work if there was a long list of preset questions because spammers would get hold of the source code and configure and adapt their bots accordingly.

Also, on top of this, a system could be implemented before a user is even allowed to comment.

Firstly, make sure the referrer of the comment’s page is what it should be. Yes, I know this can be hacked, but it’s worthwhile anyway.

Secondly, make sure the user is not trying to post a comment only seconds after initially loading the post. Any normal user would wait at least a few seconds between requests, a spam bot may not.

Thirdly, make sure that no one can post a duplicate comment within an hour of another one. If there are successive attempts keep pushing back the time limit until a duplicate can be posted.

So there you have it… it’s only a rough outline of my ideas.

Let me know what you think.

13 | Posted by: DarkBlue (Registered User) | ~ 2 years, 5 months ago |

Thank you for taking my comments seriously.

Is there any other way?

In addition to all of this I realised that although a lot of my solutions would work as a standalone solutions one of of my own personal blogs, I needed to think of something which would scale up so to speak. Something that would work as a MT plugin or where a vast majority of the blogging community adopted the method. Please note that this would also have to work perfectly well as just one - many methods I have seen proposed would require the entire webloging community to participate, which is effect renders them useless.

I agree. Solutions that rely on registration on a master-server don’t really appeal to me either. If my website is being targetted by the spammers then I would prefer to address the problem locally.

He called his test the Turing Test. If you are not familiar with the test I suggest you do some googling as it is too large a topic to go into here.

I’m familiar with it. Have you met Gina? :-)

We are challenging a human to prove their “human-ness” so to speak.

Which is exactly what we are doing with the Captcha of course. However, I do appreciate that my implementation is a visual one and that visitors with vision impairments are likely to be unable to proceed past the device.

Consider this: Every time the site’s author posts a new comment to the site the content management system, or blogging software asks the author to specify a question.

While it’s a great idea Noah, it does introduce problems of its own:

  • How do we deal with multiple languages?
  • How do we handle misspelt answers - “accounting for spelling mistakes” is easier said than done?
  • Not autonomous - the website operator would have to maintain a database of questions and answers (possibly with a list of variations for languages and spelling variations) and rotate them periodically to prevent pattern recognition.

If the user did not pass the test the comment is simply held back for moderation.

It’s amazing how sometimes the simplest things can have the biggest impact. Noah, you’ve hit on something right here.

I could retain my Captcha system if I were to make it non-compulsory. Consider the following process:

  1. User enters his comment into the approriate form (on which there is a non-compulsory Captcha)
  2. If the user enters the Captcha text and that text matches the system’s record, then the comment is accepted
  3. If the user submits his input without entering the Captcha text, then his comment is still accepted but it is not immediately published - it is held back for moderation
  4. If the user enters the Captcha text but it doesn’t match the system’s record, then he is invited to try again until he either achieves a match or leaves the comment for the moderator

I think this workflow enjoys the benefit of the Captcha without impacting usability/accessibility. What do you think?

make sure the referrer of the comment’s page is what it should be

I concur. Indeed, any web-form processor should check for a valid referrer as a first line of defence.

make sure the user is not trying to post a comment only seconds after initially loading the post

Tricky to achieve neatly on an stateless web!

make sure that no one can post a duplicate comment within an hour of another one

This is in fact a feature of the comment handler on this website.

Thanks for your interesting ideas and suggestions Noah. Believe me, I am taking them seriously. You will probably see a few changes to the comments handler here within the next few weeks.

14 | Posted by: Noah (Guest) | ~ 2 years, 5 months ago |

I keep getting:

Database Error

I have encountered the following error while performing database operations. Consult with your System Administrator or ISP.

Return to the previous page, or use the “Back” button.

15 | Posted by: DarkBlue (Registered User) | ~ 2 years, 5 months ago |

This is a test:

`’“”

When these characters are in a post they are causing the comments handler to barf - and these characters will often appear since I use SmartyPants to generate them in posts that are copied-and-pasted here!

Hopefully this is fixed now…

16 | Posted by: Noah (Guest) | ~ 2 years, 5 months ago |

I’m familiar with it. Have you met Gina? :-)

Yes… :)

While it’s a great idea Noah, it does introduce problems of its own:

  • How do we deal with multiple languages?
  • How do we handle misspelt answers - “accounting for spelling mistakes” is easier said than done?
  • Not autonomous - the website operator would have to maintain a database of questions and answers (possibly with a list of variations for languages and spelling variations) and rotate them periodically to prevent pattern recognition.
  • How do you deal with multiple languages for your posts? Are not most posts bloggosphere-wide mono-language? Where is the problem in the thinking that any user reading your post will be able to comprehend a question in the same language?
  • PHP’s or MySQL’s SOUNDEX function would be a start, there are many more inbuilt algorithms that can tell match misspelt words
  • I think you might have misunderstood me. Each question is unique to that specific post and is written in the same language. What ever system handles the posting can also easily maintain a list of questions and answers that pertain to specific posts - this is no mean feet by anybodies standards. Oh, and BTW about the autonomous nature - posts are not autonomous, you write them. Is it that much effort to specify a simple question/answer to each one as well?

It’s amazing how sometimes the simplest things can have the biggest impact. Noah, you’ve hit on something right here.

heh :)

I think this workflow enjoys the benefit of the Captcha without impacting usability/accessibility. What do you think?

Although not ideal, I do agree that this would be better than at present.

Tricky to achieve neatly on an stateless web!

PHP sessions would solve this, though I am not qualified to know if this is possible in other server-side languages.

17 | Posted by: DarkBlue (Registered User) | ~ 2 years, 5 months ago |

How do you deal with multiple languages for your posts? Are not most posts bloggosphere-wide mono-language? Where is the problem in the thinking that any user reading your post will be able to comprehend a question in the same language?

I don’t publish any content (at the time of writing) in any language other than English. However, the CMS I use is a commercial product, used by customers who do publish in multiple languages. Any system I implement has to be able to me used in that environment.

PHP’s or MySQL’s SOUNDEX function would be a start, there are many more inbuilt algorithms that can tell match misspelt words

That is major-league cool stuff! Thanks for pointing this out Noah. I’m going to have to investigate further since I could really make use of the “soundex” function.

I think you might have misunderstood me. Each question is unique to that specific post and is written in the same language. What ever system handles the posting can also easily maintain a list of questions and answers that pertain to specific posts…

I did misunderstand. I thought you were talking about a site-wide Q/A system rather than one that was post-specific.

Oh, and BTW about the autonomous nature - posts are not autonomous, you write them.

That’s true. But my “human test” (Captcha) is automonous, whereas a Q/A system is not. That is what I meant.

I think this workflow enjoys the benefit of the Captcha without impacting usability/accessibility. What do you think?

Although not ideal, I do agree that this would be better than at present.

Me too. I will investigate further…

PHP sessions would solve this, though I am not qualified to know if this is possible in other server-side languages.

I don’t use PHP. Nor do many other sites. My backend application is written in Perl and C. Whilst sessions are possible with this combination (and relatively easy to implement) the stateful session, to my knowledge, is not.

However, that probably wouldn’t matter if the other defences are properly implemented.

18 | Posted by: Noah (Guest) | ~ 2 years, 5 months ago |

I don’t publish any content (at the time of writing) in any language other than English. However, the CMS I use is a commercial product, used by customers who do publish in multiple languages. Any system I implement has to be able to me used in that environment.

From a background of CMS development I can tell you that if a CMS can handle posts in multiple languages - it could handle questions/answers in multiple languages. Depending on the licence of the CMS either a hack/module needs developing or a request to the publishers. I don’t know much about MT as I’ve never used it, but I have heard a lot about the vast array of pluggins you can get. Surely this idea could be implemented in MT via a pluggin?

That is major-league cool stuff! Thanks for pointing this out Noah. I’m going to have to investigate further since I could really make use of the ‘soundex’ function.

That’s just the surface! :)

Check out the Metaphone algorithm, developed by Lawrence Philips. This algorithm is available in PHP as “string metaphone ( string str)”

Also, you might want to look at the levenshtein algorithm. This one is great. It calculates the “Levenshtein-Distance” between two strings.

I don’t use PHP. Nor do many other sites.

Hehe… I had to pick you up on this one! While the use of Perl (which is rather long in the tooth), a la MT, is in widespread use amongst bloggers you will find that PHP is in use by over 16,251,453 domains and with a 52.65% market share last year, it’s hard to question the dominance of our favorite language. :)

See: http://www.php.net/usage.php http://www.phpfreaks.com/articles/172/0.php http://www.sitepoint.com/blog-post-view.php?id=170246

My backend application is written in Perl and C. Whilst sessions are possible with this combination (and relatively easy to implement) the stateful session, to my knowledge, is not.

http://www.w3j.com/6/s3.stein.html

There doesnt apear to be much info on statefull sessions in Perl but I am sure it is possible WITHOUT COOKIES. There is no point using cookies as spam bots will just ignore them. instead generate a unique id for each post request. Put this id in a hidden form element. When the user or spam bot submits the form you can reference the request time against the time the unique id was set - simple.

19 | Posted by: DarkBlue (Registered User) | ~ 2 years, 5 months ago |

I can tell you that if a CMS can handle posts in multiple languages - it could handle questions/answers in multiple languages.

I know. I don’t know what I was thinking of. I blame the lack of sleep! :-)

Surely this idea could be implemented in MT via a pluggin?

I’m sure it could, if it hasn’t already. I’m only familiar with the Markdown and SmartyPants plug-ins. I’ve never deployed MT, so I have no knowledge of what else is out there.

Check out the Metaphone algorithm, developed by Lawrence Philips. This algorithm is available in PHP as “string metaphone ( string str)”

Also, you might want to look at the levenshtein algorithm. This one is great. It calculates the “Levenshtein-Distance” between two strings.

Wow, this is great stuff. Thanks for these pointers Noah. I’m not going to sleep for weeks now! ;-)

I had to pick you up on this one!

There’s no way I’m going to get involved in a discussion about languages. In my opinion, it’s a toolkit and one simply chooses whichever tool is required for the job in hand.

I like Perl, it’s that simple.

There doesnt apear to be much info on statefull sessions in Perl but I am sure it is possible WITHOUT COOKIES.

I’m not sure it is. Obviously sessions are possible with Perl, but maintaining a stateful connection to a web-server (or the illusion of the same) is not something that I’ve ever come across in the Perl world. To be fair, I’ve never had this as a requirement anyway.

20 | Posted by: Noah (Guest) | ~ 2 years, 5 months ago |

Wow, this is great stuff. Thanks for these pointers Noah. I’m not going to sleep for weeks now! ;-)

Glad to know I helped! :)

There’s no way I’m going to get involved in a discussion about languages. In my opinion, it’s a toolkit and one simply chooses whichever tool is required for the job in hand.

I like Perl, it’s that simple.

Oooh… My bad. Now that was a flame war just waiting to happen! heh :) Yeah your absolutely right though, just different tool kits for the same job.

Oh! about sessions: You wouldn’t need a session if you simple stored a tempory id in a table against a time stamp. When the user submits the comment form, with the id hidden in a form element, the script compares timestamps. If the form does not have an id, or the id is incorrect (i.e. non-existant) then the comment is treated as failed and held back for moderation. The only time this would ever happen is if the form had been modified by the end user… i.e. spam bots.

21 | Posted by: DarkBlue (Registered User) | ~ 2 years, 5 months ago |

Now that was a flame war just waiting to happen!

Skillfully defused by yours truly! :-D

You wouldn’t need a session if you simple stored a tempory id in a table against a time stamp…

True. I use a similar mechanism for tracking the Captchas themselves. The only real drawback here is that there is some clean-up overhead. I appreciate that this is a small price to pay for the very obvious benefits offered.

22 | Posted by: Nick Clark (Guest) | ~ 2 years, 1 month ago |

Thank you for posting the dictionary file you used. The only problem we found was that four character or less words include “sex,” “shit,” etc. This becomes an issue (that can easily be fixed) in family friendly uses.

-Nick Clark

23 | Posted by: DarkBlue (Registered User) | ~ 2 years, 1 month ago |

Thanks for advising me of this problem Nick. I’ll scan through the dictionary when I get a minute to make it more “family-friendly”.

24 | Posted by: Electrician (Guest) | ~ 1 year, 10 months ago |

I just read on Wiki that there is a way to circumvent and defeat captcha’s by fooling humans into doing the reverse Turing work for you on a different website, disguised as something else, pulling your image and presenting it to the user there. With the accessibility issues already widely reported, I really don’t know whether captchas are the right panacea for the problem.

25 | Posted by: Noah Slater (Guest) | ~ 1 year, 10 months ago |

You have a point - but that method is mainly used to circumvent CAPTCHAs used by the likes of Yahoo and Hotmail so the spammers can get thousands of free email accounts.

When it comes to blogs, I doubt any such system (which must be fairly complex) would ever be used.

The only reason anyone would want to get past a CAPTCHA on a blog would be to crop-dust thousands upon thousands of blogs with spam. For this reason, when the spammer hit a few CAPTCHAs here and there I cannot imagine them having the motivation to pursue. It would simply be uneconomical for them.

Putting aside the accessibility issues, blog comment forms seem quite a reasonable place for CAPTCHAs… until a generally more elegant solution is found.

26 | Posted by: DarkBlue (Registered User) | ~ 1 year, 10 months ago |

Personally I hate having the Captcha on my comments handler. I believe it’s a real barrier to some users. I also know that some Urban Mainframe readers won’t post comments here, because they don’t approve of the Captcha system.

Captcha’s aren’t perfect, I agree. But what is?

I am prepared to accept the inconveniences of the Captcha because I simply don’t want to have to handle comment spam. Since I introduced the system, I have had only 2 spam comments - something of a record for a weblog.

I have alternatives available. I could switch the Captcha off today if any of the alternatives were any better:

  • Restrict comments to registered users only - I’d hate to do this, I think this is an even bigger barrier than the Captcha. No-one’s going to register just to post a comment, hence no-one will comment.
  • Moderated comments - Means I’d then have to deal with spam. But the biggest problem with moderated comments centres on the delay that moderation introduces. If I post a comment on a website, I want to see that comment right away. Comment threads lose their immediacy when they are moderated.
  • No comments - I could disable comments completely. This is obviously the least attractive option.

I hate having to justify myself like this. However, this is my weblog. I have to maintain it. I don’t have time to deal with comment spam, so the Captcha system is going to remain in place, at least until I find a better alternative.

27 | Posted by: Claude (Guest) | ~ 1 year, 5 months ago |

I wish a Captcha type system were not necessary. Unfortunately it is. Recently I was hit by several thousand comment spam postings from the same outfit. I did a little research and find over 100,000 blogs, forums, and other comment type pages clogged with their garbage.

I have many websites for which I’m responsible.

On those sites for which the comments were not necessary, I used option 3. “* No comments - I could disable comments completely.”

On those sites for which the comments were a necessary part of the website, I monitor the comments by deleting errant postings after the fact. This worked until recently.

On one site that was prone to abuse, I monitored the comments pre-postings. This has the unsatisfying delay you mentioned.

I tried restricting to registered users. Unfortunately, the new breed of content spammers have register bots (or humans) that pave the way for spamming by pre-registering. I believe the tactic is to massively spam then run, hoping that the job of deleting the spam comments are so onerous that the comments will be left. Judging by 100,000 pages surviving long enough to be googled I suspect this strategy works for them.

So that leaves me with either turning off all my forums or using a Captcha system. Which brings me to your page.

Thanks for posting the perl you are using. This cuts some development time from my end. Much appreciate it.

Your Comments
  • Formatting your comments
  • A valid email address is only required if you wish to receive notifications of new comments posted in relation to this page


remember my details:
notify me of new comments:


W3C VALIDATE XHTML
W3C VALIDATE CSS