Comment Spam Compiled and Interpreted

Following on from Automated Blog Comment Spam? and the feedback (many thanks), figured I’d compile (and interpret) some of it into something more ordered.

Gnomes or Robots?

The answer to who (or what) is posting comment spam seems to be both sad gnomes with little life and automated scripts / programs. Given that being the case, the conclusion I still have is different approaches are required if we want to prevent human submitted spam vs. script submitted spam (emphasis on the prevent – see “Remove the Incentive” below).

Have yet to find any hard figures but I also imagine the more serious problem is spam automation, based on anecdotal evidence related to attacks on some of the well known blogging apps as well as solutions people have adopted which had a dramatic effect on reducing spam. Obviously any automated process is capable of generating quantities vastly greater than anything possible via manual data entry.

No Bars to Legitimate Use

…or the “Accessibility Curse”. There seems to be a general agreement that posting a comment on a blog must be easy for legitimate users. In fact the ideal scenario is legitimate users should not be impacted at all by whatever spam protection mechanisms are in place.

Some people are willing to require a user sign-up / authentication and have found that’s already enough to discourage spammers. The risk though is discouraging legitimate use. Also, as sites like Hotmail have discovered, it’s quite possible to automate registration and login with scripts, although it’s a lot more work. Really think it suggests making your comment posting API more complex is enough to discourage todays breed of spammers (more on that shortly).

There was some talk about the use of captchas, to sift out the humans from the scripts. The key arguments against were focused on accessibility for legitimate users; are the images actually readable? what about the hearing impaired? A couple of answers there – check out the ASCII-based captchas Wez uses on his blog – very readable but still requiring a PhD in Computer Science to analyse programmatically. Also check out Colin’s thoughts on Turing, With Audio.

Another question on captchas and ingenious ways to circumvent them was raised a while back by Christian here. People seem to have reacted to this like “The End of Captchas!”. In fact I expect this has only happened rarely and it’s also not difficult to stop anyway – either research hotlinking prevention or use Wez’s ASCII captchas which are, by nature, not hotlinkable.

Although it’s possible to captchas in a secure and accessible manner, they’re still an extra step for legitimate users plus I believe they’re overkill for the problem. What’s required is not actually sifting out the human users but rather sifting out the legitimate user agents (web browsers) from the scripts…

Preventing Automation

For me there’s now enough anecdote to suggest that making your posting API a little more complex is enough to block scripts posting spam automatically.

One comment mentioned Pete Bowyer’s simple but effective solution, which requires a single extra step by users with a web browser but would need more than just LWP::Simple to be scripted.

Elsewhere a WordPress user described the immediate effect of simply renaming the POST url on spam. One of the comments following from that was particularly interesting;

The renaming trick works for most of the spam robots – as long as you remember to delete wp-comments-post.php off your server too as somebody mentioned :p There are however, a few robots out there which seem to parse the entire index.php file to find what the comments file name is, I’ve also changed the comment form variables but still a few get through probably because the robot parses the comments form and gets the variable names too. So, as somebody mentioned, this is like the cold war where you have to adapt to constantly keep ahead of the spammers.

For those that go so far as parsing forms, Spam Stopgap Extreme;

This prevents spammers from automatically scraping the form, because anyone wanting to submit a comment *must* execute the javascript md5.

That leaves spammers hunting a Javascript runtime they can use… Having suggested similar of course people pointed out some people surf with Javascript disabled. Another angle might be something like this;

…with a form like;

Name:

Email:

Comment:




The knowledge of which form fields are actually meant to be filled in is contained in the CSS. If they get as far as parsing that, it could be made more difficult by relating styles to tags via CSS class selectors. The uniqueId in the POST URL identifies which set of fields contain the real data while a script which parses the form could be fooled into submitting data in the wrong fields, thereby identifying itself. Anyway – serves as yet another possible solution in the arms race…

Blacklisting

Thanks to a tip off from Amit, it turns out there is already a central service to help with blacklisting, described here. There’s also this WordPress plugin which uses some of the RBL (Realtime Blackhole) services which have evolved for dealing with email spam.

If we’re headed in that direction, I guess techniques that have been employed to combat email spam (e.g. Bayesian filters) are worth researching.

Regarding RBLs and blacklisting, this paper (the subject being email spam) highlights some of the problems. In fact, reading that, almost all of the problems being described, apart from “Collateral Damage and Legitimate Users”, relate to RBLs being centralized services.

Bearing that in mind, Marcus’s suggestion could well be the way to go;

RSS would provide a distributed solution.

Not just that, it attaches a name to the data, allowing “consumers” to pick who they trust for their blacklists, rather than a central service where data is provided anonymously.

There’s also a built in mechanism for keep the data fresh and managing bottlenecks. Each blogger keeps their own blacklist which is periodically updated from other people’s feeds. There’s probably a Web Service-killing insight hidden in there as well – something like: “A distributed and scalable Web is not a normalized Web” – but that’s another story…

Remove the Incentive

Simon pointed out how he uses redirects to eliminate PageRank, basically preventing the Googlebot from indexing them.

Personally I still think that eliminating PageRank is the best solution simply because it battles the economics of comment spam. As e-mail spam has shown, as long as there’s an economic incentive spammers will take more and more advanced steps to avoid filters and counter-measures.

Simon’s approach seems to have been highly effective, judging from the lack of spam he gets. Technically I guess this violates the principle of “no bars to legitimate use” – what if you want legitimate users to be able to post links and have Google associate page rank with it? It also assumes you’re dealing with “smart spammers” who realise what you’ve done – it’s not actually prevent spam and a “dumb spammer” may post anyway.

Markus made a similar remark;

There is a third party involved here that could do a lot to help. If we had a simple way of reporting the spam links to Google then the incentive could be destroyed at source. Google could drop any spam promoted website.

To an extent that’s already a possibility, as Simon described here.

Economics

Diana C. told the story of how she dealt with one comment spammer (at the end);

Within 24 hours, I got a response from a wholesale pill supplier, who explained that they received copies of the diet-pills web site’s emailed feedback, and they apologized for the spam, and told me that they were immediately discontinuing their wholesale relationship with the diet-pills web site because they have a strict anti-spam policy.

If that’s representative of comment spammers, they’re simply acting as (semi-authorized) middle-men in a marketing process. One non-technical approach may be to shift the pressure onto the suppliers with “naming and shaming” for those who fail to keep their own house in order.

Finally amusing economic spin, for those looking for opportunities, is Kitten

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.deanclatworthy.com Dean C

    That CSS trick is a nice idea!

  • http://www.mission36teen.com M36Teen

    Thanks Harry, great job of sorting! ;)

  • http://www.phpnerds.com petesmc

    ASCII-based captchas Wez uses on his blog – very readable but still requiring a PhD in Computer Science to analyse programmatically.

    I believe this would be very easy to circumvent. He uses the same font style everytime hence you would just have to parse the source code, look for the specific string of characters ignoring a majority of whitespace (not all obviously – for letter shaping / etc).

  • http://www.phppatterns.com HarryF

    Re the CSS trick, one other thing that occurs to me, to make it harder to extract which fields are hidden from the CSS would simply be to place some dummy entries inside CSS comments e.g.;

    I imagine the first thing someone's going to try, before resorting to a real CSS parser, is using regular expressions. The comment makes the regex significantly more complex, requiring "state-aware" parsing.

    I believe this would be very easy to circumvent. He uses the same font style everytime hence you would just have to parse the source code, look for the specific string of characters ignoring a majority of whitespace (not all obviously - for letter shaping / etc).

    May be you should try it and post the experience. Would be interesting to find out what degree of code complexity is required to parse it.

    Ultimately no-one's writing books like "Design Patterns for Spammers"...

  • http://www.phppatterns.com HarryF

    requiring “state-aware” parsing.

    Actually further thought says no – someone’s going to just strip the comments in a first pass. But the principle is there – make the CSS as hard as possible to parse with a regex (not the nicely formatting stuff I’ve posted in other words).

    Where I think the CSS approach works is it should be easy to modify without messing with the rest of the code. Any time someone “cracks” it’s just a matter of shuffling things around a bit.

    Bottom line is I think there’s too much hacker “mystique” around spam scripts – if we consider think like a spammer (or a better – just developer trying to solve a problem) effective solutions against automated spam become clearer.

  • Matt Mullenweg

    It’s actually pretty interesting that I get a fair amount of spam on forms on my site that have nothing to do with comments and don’t even go to the web, they go to my email (contact-type forms) or submit bug reports to my email. There are obviously generic bots that parse any comment-y form and spam it.

  • Ren

    Being looking at effective blog/post spam methods for awhile now.

    I’ve been toying with CSS approach, attempting to obfuscate an otherwise plaintext captcha code.

    It basically outputs the code in a random order, with also other random characters mixed in, but prevented from viewing by several CSS methods.

    http://homepage.ntlworld.com/jared.williams/php5/captcha.txt

    I still think its a bit simplistic, and a more effective method would also incorporate random style sheet creation, and few other ideas.

    Wez’s captcha isn’t that hard to crack, and certainly dont need a PHD. (I dont have one, but I do have 20+years dev experience, and managed to code a cracker for Wezs’ text in <50 lines of PHP code). If it used a proportional type font that would make it more difficult, but the monospace is just bit too easy.

    I think the rarity of the method of using such a method and the hassle of developing something that deciphers the code is what is preventing Wez from getting blog spam.

    As for the “The End of Captchas!” article about spammers utilising incentives for 3rd parties to solve captchas for them, I would like to know the implementation details used in the captcha system that had apparently been exploited. Or infact some evidence this had ever happened. If it did, then did the captcha system expire keys after a certain timelimit for example, thus narrowing the window opportunity for the spammer to get a working reply from a 3rd party.

  • Travis S

    I guess we should toot our own horn a bit more, but b2evolution (www.b2evolution.net) has had a centralized black-list system, using RSS as the delivery mechanism since late-spring/early-summer. We maintain a list of ip/domainnames that user’s report as spam. Once someone checks it out to determine that it really is spam, they get blacklisted. User’s can request updates to their current list and retrieve the latest updates, or they can just maintain their own. When they blacklist someone, they’re given the option to report it or just handle their own.

    I haven’t had a spam problem on any of the blogs I use it on since we started using it. The only thing I occasionally have an issue with is referer-spam, but they get added to the blacklist and handled just the same.

    b2evolution really worth a look. It evolved from the same b2/cafepress that WordPress did, we just did everything that they’re wanting to do before them – well, short of plug-ins.

  • http://mgaps.highsidecafe.com BDKR

    Nice job Harry. As I said the last time, it was a big help and so is this one. Most likely any one or combination of the above will make a huge difference on a site. I chose a combination of things:

    1) Turing test
    2) Check authors against a black list
    3) Check urls (entered into a form) against a blacklist
    4) A nightly cron job that will clean the database of those thay may have gotten past the filters (2 and 3) listed above.

    Now to take a look at that RSS black-list.

    Thanx Harry! Keep it up.

  • http://www.lastcraft.com/ lastcraft

    Hi.

    I see two problems w1th a central blacklist, but one very important advantage. The first problem is that a central server can get polluted with legitimate IPs to destroy it’s credibility. The other problem is that as soon as it is effective it will come under DOS attacks. I think the pollution threat is greater.

    The advantage is that it allows ISPs to scan such servers for their own IPs to see if they have been blacklisted by one of their clients. A central server could even automatically generate abuse mails.

    I would like to see a blog plug-in that does the following…
    1) Emits IP lists via RSS.
    2) Sends abuse reports to ISPs.
    3) Sends the spam promoted URL to Google.
    4) Polls (throttled of course) affiliate IP lists before accepting a post.
    5) Has a back propogation facility for the IPs so that an ISP can report their IP as legitimate and that message can be sent to the original IP holder for confirmation and action.
    6) Junks illegitimate posts after the abuse reports to avoid administrator intervention.

    There are four parties here and all must be taken into account: Blog/wiki administrators, blog/wiki commenters, ISPs and the search engines.

    We also need some spam hardening of blogs/wikis will be needed long term as you have summed up, but this will have to be optional as it will make the blog/wiki software harder to modify.

    At the moment it is round one to the spammers.

    yours, Marcus

  • http://diigital.com cranial-bore

    I’ve only been half following this discussion but one thought that came to mind was this.
    What if your comment form also required the user to answer a simple question. The question could be something like “what is four plus eight” or “what is the second letter of the english alphabet”.

    The questions could be drawn from a database for variety and to follow different formats.

    I think this would be a relatively minor inconvienence for human users and would require a fair amount of programming to automate the answering for all the possible questions types. The questions would be plain text for accessability.

    Obviously this wouldn’t help manual spam from people.

  • http://www.phppatterns.com HarryF

    It basically outputs the code in a random order, with also other random characters mixed in, but prevented from viewing by several CSS methods.

    http://homepage.ntlworld.com/jared.williams/php5/captcha.txt

    I still think its a bit simplistic, and a more effective method would also incorporate random style sheet creation, and few other ideas.

    Nice idea. Looking at what you’ve got already, thats probably enough of a deterent for anyone but the obsessed.

    Wez’s captcha isn’t that hard to crack, and certainly dont need a PHD. (I dont have one, but I do have 20+years dev experience, and managed to code a cracker for Wezs’ text in <50 lines of PHP code). If it used a proportional type font that would make it more difficult, but the monospace is just bit too easy.

    Pondering that one overnight and realized I dived in at the deep end, with Fuzzy logic etc. Can see a “brute force” solution now which wouldn’t be hard to implement. And “20+years dev experience” may amount to more than a few PhDs ;)

    I think the rarity of the method of using such a method and the hassle of developing something that deciphers the code is what is preventing Wez from getting blog spam.

    For me thats a key point in general. If every blog had a different comment posting “API” and scripting posts to that API was more than a couple of hours work, comment spam becomes uneconomic.

    As for the “The End of Captchas!” article about spammers utilising incentives for 3rd parties to solve captchas for them, I would like to know the implementation details used in the captcha system that had apparently been exploited. Or infact some evidence this had ever happened. If it did, then did the captcha system expire keys after a certain timelimit for example, thus narrowing the window opportunity for the spammer to get a working reply from a 3rd party.

    When you put it that way, seems even less plausible. Even if someone did pull this off, how long did it take Hotmail / Yahoo to break it? Would imagine it’s something like a game of cat and mouse where the mouse just happens to drive a Ferrari; meaning the effort to stay ahead of the spammers, development-wise, would be significantly less.

    b2evolution really worth a look

    Will do so.

    The advantage is that it allows ISPs to scan such servers for their own IPs to see if they have been blacklisted by one of their clients. A central server could even automatically generate abuse mails.

    Guess I was thinking about a slightly different kind of blacklist. Rather than source IPs / domains, had considered something which lists the URLs spammers are POSTing (and perhaps some of the key words contained in the POST) which could be used to validate the contents of a comment.

    It’s worth noting that some ISPs, in particular AOL, use some kind of proxy where their customers are assigned a new IP address from a pool, per HTTP request.

    On the ageing of data front, can also imagine that a URL used in a comment spam made two years ago no longer needs to be checked again. What sort of “lifecycles” spammers are working to would be interesting to know (which comes to your suggestion with re-publishing spam via RSS – we need sample data) but, let’s say, at any given time, only a few target URLs are being marketed using spam and that, within a period of a week, those URLs are no longer being used. If I’m subscribed to your spam feed and you are able to republish spam more or less as it happens, I can use your data to update my local blacklist before a spammer gets round to working on my blog. After, say, a week I automatically discard old information, which is in my interest for performance – long blacklists mean more overhead in searching them.

    The basic principle would probably need to be that you only republish spam or attempted spam made against your own blog, making sure you only announce unique attempts (uniqueness defined by the contents of the spam itself).

    The critical points would then seem to be reducing the time been spam identification and republishing to the absolute minimum and how you actually use RSS to republish, to make the data easy to extract.

    I’ve only been half following this discussion but one thought that came to mind was this. What if your comment form also required the user to answer a simple question. The question could be something like “what is four plus eight” or “what is the second letter of the english alphabet”.

    In theory, one problem is there will always been a finite number of questions and answers (so someone could “reverse engineer” your database) whereas captchas allow, effectively, and infinite number of combinations (and without needing a database)

    In practice think it would work, at least for the near term future, as it significantly raises the amount of effort to script a solution to.

  • http://www.shaunhills.com hillsy

    Just thinking out loud…

    Would it be possible to distribute a centralised blacklist via BitTorrent or similar? Obviously still have the problem of innocent IPs getting onto it, but might resolve a few bandwidth and DDoS issues.

  • http://www.phppatterns.com HarryF

    Would it be possible to distribute a centralised blacklist via BitTorrent or similar?

    That’s an interesting angle. What Marcus suggests, using RSS, could amount to the same thing in a way, especially if you throw in blacklist aggregators.

    And there’s a whole new angle of hype and buzz – P2P web services! ;) Actually Google tells me that’s already happened.

    Obviously still have the problem of innocent IPs getting onto it

    Still think the way to go with blacklists is not IP addresses but rather the target URLs that spammers are trying to get PageRank for. URLs are also unique…

  • http://www.passivekid.com/ nathanj

    I have got a idea.. What if you like passed a random long string via GET in which you have selected two 5 character strings from for the names of the form fields (so that they can’t pass values to the fields without knowing what the name is everytime). Then on the submitted page you just select the same points in the string and echo the submitted data or do what ever you want with it. :)

    I don’t know if this is a good idea. Although they could read the phrased html from the browser and get the field names. :(

  • Ren

    URLs would be the way to go imo too, or even perhaps a combination of the URL and the content it goes to.

    The problem becomes when a blog spammer subscribes, and then knows what urls not to attempt to post. You could hash them, but then would blogger like comments being rejected, without knowing exactly why?

    Vipul’s Razer whiplash signature scheme seems to just take into account urls.

  • http://www.phppatterns.com HarryF

    Vipul’s Razer whiplash signature scheme seems to just take into account urls.

    Interesting stuff. Looks like Razer has already got all of this worked out.

  • http://dougal.gunters.org/ EMCampbell3

    For those who are thinking about implementing IP blacklists (to block the spammer sources, as opposed to the destinations), I’d like to repeat a suggestion I’ve made before: Keep track of timestamps, and automatically expire (most) entries after a set amount of time.

    Otherwise, a faulty/ephemeral entry could hang around forever, causing problems for random users in dynamic IP pools. Of course, you’ll want facilities for “permanent” blocks in the case of certain repeat offenders, but you probably want to expire most entries after a few days go by without detecting further abuse from a particular address.

  • http://mgaps.highsidecafe.com BDKR

    I’ve noticed on b2 based blogs that the author and URL fields are both (normally) hit with terms (including URLS) that are easy to check against an existing list. When I took this approach plus the turing test, my spam intake drop close to nothing.

  • http://www.lastcraft.com/ lastcraft

    Hi.

    There is a good reason for using IPs – performance. You can write an IP list straight into your .htaccess “deny from” section. Spam can hit DOS proportions at times if you have a highish page rank and you don’t want to be running off to databases or parsing content except for difficult cases. IPs make a good first line of defence.

    As for timeouts, I often find that Chinese IPs do recur. Looks like someone allows them to buy IP blocks for spammer farms. Judging by the number of US gambling sites being in spam promoted it looks like foreign ISPs are being used for anonimity.

    That said, blog/wiki authors should not alienate the ISPs or they won’t carry this kind of service. Generating abuse reports has to go hand in hand with banning.

    yours, Marcus

  • Ren

    Does anyone know where to find aload of blog comment spam examples complete with the original blog post they were added to?

  • http://www.phppatterns.com HarryF

    There is a good reason for using IPs – performance. You can write an IP list straight into your .htaccess “deny from” section. Spam can hit DOS proportions at times if you have a highish page rank and you don’t want to be running off to databases or parsing content except for difficult cases. IPs make a good first line of defence.

    Good point – that way the requests aren’t getting as far as PHP.

    At the same time, think that should be implementable without needing blacklists e.g. get more than X amount of POSTs from a given IP address within some time period and block it temporarily for a short period (perhaps only a few minutes) using a .htaccess file. Gonna need a decent .htaccess parser though… and something “cronlike”

  • c. s.

    “What if your comment form also required the user to answer a simple question. The question could be something like ‘what is four plus eight’ or ‘what is the second letter of the english alphabet’.”

    The problem becomes “what do you do about multi-lingual communities?” I personally have come across at least 2 weblogs that comment and post in english and one other language (I think german for one, swedish or norwegian for the other). How will you accomodate for them?

  • DMerriman

    I’ve long maintained that going after the spamvertisers was the final solution; I’m wondering if anyone has considered writing something to visit the spamvertised site, locate a ‘contact us’ email, and automagically firing off an email letting them know of the spam attempt (including details).

  • Panatey Rekolin

    I like your site

  • http://voffkapreved.com voffkap

    voffkatestpreved http://voffkapreved.com

  • http://buycialiss.e33.de cailsa
  • http://man-swimsuit.atoguia.info Bush

    The site”s very professional! Keep up the good work! Oh yes, one extra comment – maybe you could add more pictures too! So, good luck to your team!

  • http://http3A2F2Fgroups2Dbeta.google.com2Fgroup2Frxpharmacy2Fweb2FVioxx.html+ Smit

    Hello world

  • emma

    Just thinking out loud…

    Would it be possible to distribute a centralised blacklist via BitTorrent or similar? Obviously still have the problem of innocent IPs getting onto it, but might resolve a few bandwidth and DDoS issues. Белая Церковь

  • http://dotancohen.com dotancohen

    “What if your comment form also required the user to answer a simple question. The question could be something like ‘what is four plus eight’ or ‘what is the second letter of the english alphabet’.”

    The problem becomes “what do you do about multi-lingual communities?” I personally have come across at least 2 weblogs that comment and post in english and one other language (I think german for one, swedish or norwegian for the other). How will you accomodate for them?

    On DotanCohen.com I have both Hebrew and English pages. In Hebrew pages I have Hebrew spam-questions, and on English pages I have the question in English. Not difficult.

  • nakliyat

    I really admire the way you approach to tackle this matter which became a global issue . I will be observing your future works and submitting my own views and results of my personal researches.

  • Pitter

    oh!
    My God! your collection is fabulous n mind blowing…….
    just keep up with ur fantastic work……
    Good Luck