Anybody who develops or maintains blog software is likely to be all too familiar with the problem of comment spamming.
I recently became the victim of spamming through a ‘contact’ form. The results of the contact form are emailed privately to a member of the site, and not displayed on the Website. However, this didn’t seem to deter the spammer – they were using an automated bot to send hundreds of them.
The CAPTCHA is a method for preventing automated form spamming which requires the user to fulfill a task that a computer would have difficulty doing – usually recognising letters or shapes in an image. However, not only does the CAPTCHA suffer from some relatively obvious flaws, but it impedes usability and accessibility, which will reduce the number of responses you get. Another flaw to the CAPTCHA is that it implies that you can trust human users while you cannot trust computers – a flawed assumption, given the number of human users with a lot of time on their hands compared to the number of people clearing up their spam. It breaks the don’t trust any user input principal.
Like Bayesian spam filtering for email, filtering based on content is a more direct solution. It deals with the actual problem – the fact that the content is spam – and doesn’t result in an accessibility or usability problem for the end user.
To solve my own comment spam problem I implemented a very simple filtering solution, simpler than (and not nearly as smart as) a Bayesian spam filtering system where each word has a score and a rank. The spam affecting me in particular was all related to online gambling, so I drew up an array of words that should trigger the filter:
$badwords = array(
'poker', 'online-poker', 'onlinepoker', 'holdem', 'casinos',
'online-casinos', 'casino', 'online-casino', 'baccarat', 'craps',
'blackjack', 'slots', 'roulette', 'keno', 'wsop'
If anybody submits any of these words to my contact form, their comment will be read by nobody (although I ran a test for a few weeks to look for false positives).
This automated form spamming is (currently) easy to block – because it is automated and it is all sent by one bot, it doesn’t tend to vary much. However, as the automated form spamming problem grows and more separate people start setting up scripts to spam my forms, I forsee that this solution will start to break down. I’d have to start blocking more and more words, to the point when there would be far too many false positives.
Bayesian filtering calculates a spam probability of some content based on an algorithm that allocates a spam probability for each word and an importance rank for that word. It is very effective in blocking spam, even though it is a little less effective than first introduced (spammers are learning to vary the spelling of their words and add fluff words).
This article on dealing with comment spam in WordPress advocates a few methods for stopping comment spamming. For instance they advocate rule-based filtering, such as looking at the number of hyperlinks in the content, ‘spam words’, just like the solution I implemented, filtering based on IP addresses, and many other hacks. It’s a good summary of the different ways to prevent comment spamming, and thus automated form spamming.
More and more people are turning off comment submission or other features in frustration at the amount of spam they have to deal with.
Is the growing popularity of comment spamming and other form spamming (including trackback spamming and contact form spamming) simply pointing out our own naivety of letting the public post comments and feedback in the first place? Allowing your visitors to contact somebody or submit a comment adds communication and community to a Web site, at the expense of increasing the opportunities to spam you. It is a paradox that we want this functionality on our websites, even though we know that we can’t trust all our users. We’re going to have to focus more an more on how we deal with the spam from those who break our trust while continuing to allow users with something to contribute to be able to use our forms.