10 Things to Check Before Using a CAPTCHA

Spam botAll CAPTCHA systems are doomed to fail. Unfortunately, this has not prevented eager developers using CAPTCHAs in even the most basic web to email forms.

No one likes CAPTCHAs. They are not fun. They can not be used by everyone, such as those with impaired vision or without graphics enabled. They slow down the sign-up process and, ultimately, they will lead to fewer real registrations.

The worst problem with CAPTCHAs is that they put the onus on the user. Users do not care if you are receiving thousands of spam messages or bogus accounts: that’s your problem. CAPTCHAs should be the last barrier of defence – not the first.

The vast majority of hacking attempts and bots can be prevented without resorting to CAPTCHAs. If you make it moderately difficult, spammers will simply move on to the next easier target. Here are some basic techniques that will stop the majority of spoofing attempts.

1. Validate everything server-side

You need to validate every field using server-side code – even if you have strong client-side validation. Be especially careful with fields that are placed in email headers. Email addresses are probably the most important values to check: use a good regular expression and watch out for HTML tags, SQL injections, or return characters (n and r in PHP).

2. Check for spam-like content

Most spammers post links to websites. If that’s not something you are expecting, it could indicate a spam bot. A third-party tool such as Akismet could help.

3. Check for rogue POST and GET values

If your form expects three POSTed fields, the existence of a fourth could indicate a hacking attempt. Similarly, check that no additional GET values have been passed.

4. Check the HTTP header

Simpler spam bots will rarely set a user agent (HTTP_USER_AGENT) or a referring page (HTTP_REFERER). You should certainly ensure the referrer is the page where your form is located.

5. Use a honeypot field

Spambots normally attempt to complete every form field so they pass basic validation. A honeypot field is one that is hidden from the user (CSS display set to none), so any value passed back is likely to come from a bot. The field should be labelled “Please leave this blank” or similar to account for those with CSS disabled or using custom stylesheets.

6. Detect the presence of JavaScript

If your page can run JavaScript, you can be almost certain it has been loaded in a browser by a human user. A simple in-page dynamically generated JavaScript function could perform a simple calculation or create a checksum for the posted data. This can be passed back in a form value for verification.

An estimated 10% of people have JavaScript disabled, so further checks will be necessary in those situations.

7. Show a verification page or fail the first posting attempt

Bots have a tough time reacting to a server response. If you are in any doubt about the validity of a post, show a intermediary page asking the user to confirm their data and press submit again.

8. Time the user response

Accounting for human behaviour is one of the best ways to spot the bots. Users will take a little time to complete forms whereas bots are almost instantaneous. I use the following method in many forms and it has been effective:

  1. The current server time is recorded when the form page is generated.
  2. The time value is encoded into a string. The actual encoding algorithm is up to you, but it must be one that is not obvious and allow decoding back to the original value. I would also recommend using unique user data, such as the IP address, as an encryption key.
  3. The encoded time is put in a hidden form value.
  4. When the form is posted back, the field is checked and decoded back to a time. This can now be compared with the current server time to ensure the response time falls within a specific window, e.g. between 20 seconds and 20 minutes.

There are several benefits to this process: it does not rely on client-side technology, the time value must be in the returned data and, even if your form is spoofed, it limits the number of bogus submissions that can be sent.

9. Log everything

Keep a log of everything that occurs during a form submission process. This need not be an elegant solution; writing to a file will be adequate. The information you gather will be invaluable when spotting hacking attempts and implementing solutions.

10. Handling the extreme cases

Some of the techniques above will fail for legitimate users, e.g. checking for JavaScript or the HTTP header. It is only likely to affect a small number of users so a CAPTCHA could be used in those circumstances.

Alternatively, if there is any doubt about the data validity for a small number of users, you could add human verification to your process. Ensure it simple to operate, i.e. email an administrator and only accept the post once a reply is received.

CAPTCHAs can be essential for sites that could incur significant monetary loss or are obvious targets for illegal activities, such as online banking and webmail. However, they are overkill for most forms: a combination of techniques will stop the majority of bots without making sign-ups difficult for real users.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Tarh

    Email addresses are probably the most important values to check: use a good regular expression and watch out for HTML tags, SQL injections, or return characters (n and r in PHP).

     
    Just to add to that: make sure that the regex or other validation which you are using is whitelisting rather than blacklisting (i.e. only allow input which you know is safe rather than trying to reject input which you know is unsafe). Also keep your character sets in mind; don’t mix encodings or you may let evil bytes slip by.
     

    You should certainly ensure the referrer is the page where your form is located.

     
    Or at least make sure that the first part of it is your website (keep in mind if you have an optional www. in which case you have bigger problems). That way you don’t have to update the “trusted pages” list if you decide to link a second form to the script.
     

     
    OK so I lied; I just want you to do that so that us RefControl users don’t need to add another exception :-P

  • http://librex.us/ davideldridge

    Honeypots and JavaScripts:

    Both of these could present problems for users with disabilities (and obviously the JavaScript fix will pose problems to those with it disabled).

    Honeypots: If blind users use user agents that do not use the screen stylesheet, that honeypot will show, and they will fill it, “proving” that they are the bots you are protecting against.

    JavaScript: Often blind users’ user agents have JavaScript disabled because there are events that are hard to signal audibly. (Consider the IE click that lets you know a page has changed: user agents like Jaws don’t signify those events well, like a partial postback, because it is hard to indicate which part of the page was synchronized.)

    If that is not a serious consideration for you, that is fine. But it is worth it to consider that folks with disabilities often have JavaScript disabled might be adversely effected by those two particular measures.

    Thanks so much for the help.

  • http://www.jasonbatten.com NetNerd85

    Try running some forums with the “advice” you have given. See what happens.

  • Anonymous

    @NetNerd85: Okay, you tell us — what happens?

  • http://green-beast.com Green Beast

    Honeypots are great, I use and recommend them myself. They should be offset (off-screen to the left), though, and not hidden with display:none. This will ensure the label (warning not to fill in the input) will be available to screen reader users so they don’t get caught in the trap.

    I also recommend validating any options found in select elements. In other words, if the option isn’t in the array (typical arrangement), the form will not submit.

    I addition to this, I also suggest limiting text inputs to a reasonable number of characters (validated server-side) as this, too, will also catch bots in the act.

    I do offer more on this subject in this article: http://green-beast.com/blog/?p=220

    Mike

  • http://www.olsenportfolio.com/ nrg_alpha

    For email format validation, many people don’t get the regex quite right. I recommend using a well written email parser such as this one:

    http://www.iamcal.com/publish/articles/php/parsing_email/
    A bit of a heavy read. But towards the bottom of the page, there is a ‘simplified’ version.

    Most notably, a more thorough RFC 3696 Parser is found in the download link at the bottom of the page which ultimately leads to:
    http://code.iamcal.com/php/rfc822/rfc3696.phps

  • Stevie D

    davideldridge wrote:

    Honeypots: If blind users use user agents that do not use the screen stylesheet, that honeypot will show, and they will fill it, “proving” that they are the bots you are protecting against.

    So a blind user will come to a field where the label says “please leave this field blank” and fill it in … with what? And why?

    JavaScript: Often blind users’ user agents have JavaScript disabled because there are events that are hard to signal audibly. (Consider the IE click that lets you know a page has changed: user agents like Jaws don’t signify those events well, like a partial postback, because it is hard to indicate which part of the page was synchronized.)

    Fair point – I would envisage that a Javascript check, like a timestamp check, would lead to a referral rather than a failure. The post wouldn’t be rejected, but might then be passed to a intermediary page as in point 7, or go for manual moderation.

  • Stevie D

    Green Beast wrote:

    Honeypots are great, I use and recommend them myself. They should be offset (off-screen to the left), though, and not hidden with display:none. This will ensure the label (warning not to fill in the input) will be available to screen reader users so they don’t get caught in the trap.

    If the input field is included in the display:none;, surely any screen reader setup that noticed the field would also read the label? Or am I being naively optimistic about screen readers…?

  • http://www.optimalworks.net/ Craig Buckler

    As far as I’m aware, screen readers should ignore display:none but they will read out text-indent. I’m sure that won’t be the case for all readers, though.

    Besides that, a honeypot field cannot be any worse than a CAPTCHA for visually-impaired users!

  • http://librex.us/ davideldridge

    I am not saying do or don’t [use accessible technology, etc.] ultimately, just implement thoughtfully. People (usually at my end) get shrill about standards and accessibility. And though it is my bread and butter as a government webmaster, I still think we have to use our noggins for this.
    Stevie D wrote:

    So a blind user will come to a field where the label says “please leave this field blank” and fill it in … with what? And why?

    I don’t think they will fill that in, and that would be a fair use. But that assumes that developers use that kind of text. It is important that developers thoughtfully implement that kind of input. If they use display:none; and a question like “What’s your age?” that would trip a bot, and folks using screen readers.
    Craig Buckler said:

    Besides that, a honeypot field cannot be any worse than a CAPTCHA for visually-impaired users!

    I think many new CAPTCHA and CAPTCHA-like devices use an audio version that allows blind users to step around it. So, honestly, (re-)CAPTCHAs are relatively accessible, shopping around can help.
    I am still a fan of using akismet (or other back-end validation), though, and I think most users (e.g. blind commenters in this case) who deal with these problems will appreciate your use of such technologies, or at least be less frustrated by it, since it is transparent to them. And while I haven’t gotten that many commenters on my blogs, I approve all comments from new users, and have had a good experience with akismet’s accuracy: where spam constitutes about 90% of the comments I receive.

  • http://green-beast.com Green Beast

    Regarding display:none, it all depends on screen reader version and the version/make of the browser it’s being used on. Some will handle display:none just fine, others won’t. It’s just like the legacy browser conundrum we face. Using an offset class via negative margin (or indent I suppose), as far as I can tell, is universally supported. Remember the solution must bear the label stating to leave the input blank for when there is no CSS support. Fortunately ‘bots are stupid about reading labels too.

    Aksimet isn’t my cup-of-tea, at least as it concerns blog comments. Since it’s not flawless it has a holding queue, and that must be checked. I prefer something more passive that does its job without oversight. Bear in mind, please, that I haven’t used Akismet for anything other than comment moderation on a blog so I may be missing something. For spam control on my blog I have Akismet, Bad Behavior, and Mike Jolley’s WP Comment Spam Stopper (http://blue-anvil.com/archives/wordpress-comment-spam-stopper-plugin). I listed those fro least favorite/effective to most favorite/effective – based solely on my personal experiences.

    Cheers.
    Mike

  • Phytoplankton

    This is exactly what I’ve been looking for.
    I really don’t want to implement a Captcha on my site as it’s based on people being able to add content without any hassle.

    Two things that I’m going to work on implementing this weekend are detecting Javascript and checking the time between the Page_Load and the POST.

  • Joakim Kejser

    8. Time the user response

    I really like this one. But why save the server time at form generation in a hidden field, and not i a session?

    Joakim

  • http://www.optimalworks.net/ Craig Buckler

    why save the server time at form generation in a hidden field, and not i a session

    You certainly can do that, but users must have cookies enabled for it to work. Also, the postback ensures the key was generated in the original form.

  • jacoetheron

    A while back I had to create a visitor’s book (a page where visitors can post comments). I used Javascript to insert the form on the page (to reduce chances of spam bots), strip all HTML form the post and limit the length to 255 chars (adding a “…” if exceeded) before it gets added into the database.

    You certainly can do that, but users must have cookies enabled for it to work.

    Using php’s session does not need cookies, it stores the data in both cookies and as a session file on the server (for backup). I agree that the key should be checked.

    Great post

  • http://www.optimalworks.net/ Craig Buckler

    @jacoetheron

    Using php’s session does not need cookies

    True – the session ID can be propagated in the URL. However, that could affect several of the other bot-checking techniques. It could also be posted in the form data, but that’s not much different to using an encrypted time value.

  • technolojik

    Try running some forums with the “advice” you have given. See what happens.

    hiberya