Dealing With Automated Form Spamming

Anybody who develops or maintains blog software is likely to be all too familiar with the problem of comment spamming.

I recently became the victim of spamming through a ‘contact’ form. The results of the contact form are emailed privately to a member of the site, and not displayed on the Website. However, this didn’t seem to deter the spammer – they were using an automated bot to send hundreds of them.

The CAPTCHA is a method for preventing automated form spamming which requires the user to fulfill a task that a computer would have difficulty doing – usually recognising letters or shapes in an image. However, not only does the CAPTCHA suffer from some relatively obvious flaws, but it impedes usability and accessibility, which will reduce the number of responses you get. Another flaw to the CAPTCHA is that it implies that you can trust human users while you cannot trust computers – a flawed assumption, given the number of human users with a lot of time on their hands compared to the number of people clearing up their spam. It breaks the don’t trust any user input principal.

Like Bayesian spam filtering for email, filtering based on content is a more direct solution. It deals with the actual problem – the fact that the content is spam – and doesn’t result in an accessibility or usability problem for the end user.

To solve my own comment spam problem I implemented a very simple filtering solution, simpler than (and not nearly as smart as) a Bayesian spam filtering system where each word has a score and a rank. The spam affecting me in particular was all related to online gambling, so I drew up an array of words that should trigger the filter:


$badwords = array(
'poker', 'online-poker', 'onlinepoker', 'holdem', 'casinos',
'online-casinos', 'casino', 'online-casino', 'baccarat', 'craps',
'blackjack', 'slots', 'roulette', 'keno', 'wsop'
);

If anybody submits any of these words to my contact form, their comment will be read by nobody (although I ran a test for a few weeks to look for false positives).

This automated form spamming is (currently) easy to block – because it is automated and it is all sent by one bot, it doesn’t tend to vary much. However, as the automated form spamming problem grows and more separate people start setting up scripts to spam my forms, I forsee that this solution will start to break down. I’d have to start blocking more and more words, to the point when there would be far too many false positives.

Bayesian filtering calculates a spam probability of some content based on an algorithm that allocates a spam probability for each word and an importance rank for that word. It is very effective in blocking spam, even though it is a little less effective than first introduced (spammers are learning to vary the spelling of their words and add fluff words).

This article on dealing with comment spam in WordPress advocates a few methods for stopping comment spamming. For instance they advocate rule-based filtering, such as looking at the number of hyperlinks in the content, ‘spam words’, just like the solution I implemented, filtering based on IP addresses, and many other hacks. It’s a good summary of the different ways to prevent comment spamming, and thus automated form spamming.

More and more people are turning off comment submission or other features in frustration at the amount of spam they have to deal with.

Is the growing popularity of comment spamming and other form spamming (including trackback spamming and contact form spamming) simply pointing out our own naivety of letting the public post comments and feedback in the first place? Allowing your visitors to contact somebody or submit a comment adds communication and community to a Web site, at the expense of increasing the opportunities to spam you. It is a paradox that we want this functionality on our websites, even though we know that we can’t trust all our users. We’re going to have to focus more an more on how we deal with the spam from those who break our trust while continuing to allow users with something to contribute to be able to use our forms.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.mittineague.com Mittineague
  • Greg

    CAPTCHAs work within reason, but there is the possibility of proxying them. It’s a well-known trick that spammers use free photo sites of an adult nature and get users to solve proxied CAPTCHAs for them in order to gain access.

    As demonstrated in some of the linked articles, as the CAPTCHAs become more difficult for ever-more-sophisticated robots to solve, they become more difficult for humans to solve. That’s the wrong way to go as you want certain humans to post or register, and a CAPTCHA that pushes them away defeats its own purpose.

    I think the best solution is a game that involves motion, logic, and has a timestamp component to it (i.e. you have to click a moving target and the time at which you click is as important as where you click). The latency of proxying would cause the timing of the click to be out of whack.

    - Greg

  • http://www.deanclatworthy.com Dean C

    A good idea I had last week was to create a central repository of questions, such as what is the capital of France that anyone with a brain could answer and just offer them these questions in a dropdown and a field to place the answer in. I haven’t seen this technique used yet and I think it’d work nicely.

    Another idea I had was if you have some form of article or news system on your site, upon generating the registration form pull a random article and also say for arguements sake one word from the first sentance. And then ask the question “What is the 5th word in the first sentance of this (link) article?”.

    Just a few ideas.

  • http://boyohazard.net Octal

    Though not a direct solution to automted form spamming, Simon has the right idea for an anti-comment spam measure

  • http://www.homeorchardsociety.org SRTech

    I found this PHP version of a Bayesian filter a while back, and am thinking about using it to help stop spam.

    http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14

  • meddle

    what i’ve found blocks the most spammers is using a blacklist. I got MT’s file, and converted it in an array of regexp’s. Then, I include the blacklist in the file retrieving the form’s data. if the comment contains a link, then i check if the message matches any of the forbiden words/expressions and if so, i block the user (exit php script). Since then, i stopped 99.999% spammers (14007 attempts for only 20 success spam recieved)

    Sergi

  • ejustice

    I more or less solved this problem on my site by blacklisting IP addresses that spammed the contact us form.

    If the bot is coming from one of these IPs, then the form is automatically hidden from them using a simple if statement (if the ip is on the blacklist, don’t show the form).

    Furthermore, when a user gets an email from the form, a link is appended to “Block this IP” that will add the sender’s IP to the blacklist table at the click of a button.

    This way it’s easy to stop repeat offenders because they do tend to come from a single IP.

  • http://www.realityedge.com.au mrsmiley

    I wonder if bayesian filtering theory can be applied to say a knowledge base or search engine to locate useful results? We tend to think of these technologies as 100% related to spam only, but the question of what is good or bad, relevant or irrelevant applies to several domains.

    If you wanted to be ultra tricky you’d only allow comments that are related to the post/article/blog in question. Not quite sure how you would train such a beasty, but this would future protect you systems as well and require little maintenance once trained.

  • rushiku

    Unless there’s an easy work around, I’m with ejustice.

    It seems that rather than dealing with the effect, treating the cause would be more effective.

    While I haven’t tried this, here’s the idea: Implement black, gray and white lists of IPs. When comments are submitted: blacklist IPs never go through, whitelist IPs always go through and unlisted IPs go through the graylist procedures.

    In my graylist world, an IP could comment, say, once every 60 seconds, up to 5 times in 60 minutes. Repeated submissions (obvious automation levels) result in the IP being moved to the blacklist.

  • http://www.igeek.info asp_funda

    Octal, Simon’s idea at http://simon.incutio.com/archive/2003/10/13/linkRedirects is quite in-effective. It doesn’t give the links a pagerank, but it doesn’t stop the spammers from spamming you. they’ll still continue to spam you as they don’t read your blog to see that their links won’t get a pagerank!! so you are still stuck with spam, whether your links give the pagerank or not!!

  • user_friendly

    Rushiku, I’m on a dynamic IP. If my IP was previously used by a spammer and u blacklisted their IP… then you end up blacklisting innocent me too.

  • http://www.whitford.org.au/ bobbymac

    Despite it being increasingly difficult to prevent comment spam in an automated way, the spam is always easy for us to identify manually as spam – we can overlook the mis-spellings and other masking techniques used by the spammers. And that has nothing to do with whether it is sent once or a dozen times. If only that mental process could be reproduced!

  • kippar

    Nice site. Thank to work…

  • shepherd

    I’m love this great website. Many thanks guy

  • anon

    You could just use Ajax’s. therefore there’d be no action page to process the data. Most of these bots just submit the form to the action page. If you eliminate it all together and use Ajax to submit data, you clear yourself of that burden of using CAPTCHA’s. I’ve done so with Flash forms, and action script aswell….

  • http://www.sitepoint.com/ mmj

    @anon:

    Relying on Ajax to submit a form sounds like a very bad idea to me. There are too many things that can go wrong, and too many ways it can simply annoy the user because it doesn’t work as expected. If the user’s browser isn’t capable of processing the script, they can’t post. That’s a huge accessibility problem, as you have people using your blog on lots of different browsers, even mobile phones, where at least a regular POST form would work. There’s more maintenance required, because you have to keep ensuring that it won’t break in any new browser, whereas a POST form has been part of the HTML spec for years and is usable on any browser, even line-mode ones with no images or Javascript! You would be increasing security by decreasing accessibility, which is a mistake, because if a spammer were smart enough, they could simply look at your source code and figure out which request your AJAX script is sending and continue to spam. It would be a bit more work on the part of the spammer, yes, but it is also likely to frustrate some of your legitimate users. It also doesn’t account for the case where the spammer is submitting their spam posts by hand, which is often the case. Most of the form spam I’ve got on one particular blog has been from people who copy-pasted some code into the form and added a short message to it that’s semi-relevant (agreeing with someone else’s post for example).

  • anon

    mmj:

    I guess you’ve never really coded with AJAX. The only possibility is that the person doesn’t have Javascript turned on. And if they were to view source on any browser all they’ed see is the javascript functions. If the tried to call them from an outside source it would violate security and be rejected by the server. AJAX’s can work if you know the inner workings of AJAX’s and how to program with it.

  • http://www.sitepoint.com/ mmj

    AJAX works on the client-side. It would be possible for a spammer to see what requests their browser is sending to the server and mimic them. They wouldn’t have to use Javascript to achieve the same – at the lowest level it’s just an HTTP request.