Comment Spam Compiled and Interpreted

Following on from Automated Blog Comment Spam? and the feedback (many thanks), figured I’d compile (and interpret) some of it into something more ordered.

Gnomes or Robots?

The answer to who (or what) is posting comment spam seems to be both sad gnomes with little life and automated scripts / programs. Given that being the case, the conclusion I still have is different approaches are required if we want to prevent human submitted spam vs. script submitted spam (emphasis on the prevent – see “Remove the Incentive” below).

Have yet to find any hard figures but I also imagine the more serious problem is spam automation, based on anecdotal evidence related to attacks on some of the well known blogging apps as well as solutions people have adopted which had a dramatic effect on reducing spam. Obviously any automated process is capable of generating quantities vastly greater than anything possible via manual data entry.

No Bars to Legitimate Use

…or the “Accessibility Curse”. There seems to be a general agreement that posting a comment on a blog must be easy for legitimate users. In fact the ideal scenario is legitimate users should not be impacted at all by whatever spam protection mechanisms are in place.

Some people are willing to require a user sign-up / authentication and have found that’s already enough to discourage spammers. The risk though is discouraging legitimate use. Also, as sites like Hotmail have discovered, it’s quite possible to automate registration and login with scripts, although it’s a lot more work. Really think it suggests making your comment posting API more complex is enough to discourage todays breed of spammers (more on that shortly).

There was some talk about the use of captchas, to sift out the humans from the scripts. The key arguments against were focused on accessibility for legitimate users; are the images actually readable? what about the hearing impaired? A couple of answers there – check out the ASCII-based captchas Wez uses on his blog – very readable but still requiring a PhD in Computer Science to analyse programmatically. Also check out Colin’s thoughts on Turing, With Audio.

Another question on captchas and ingenious ways to circumvent them was raised a while back by Christian here. People seem to have reacted to this like “The End of Captchas!”. In fact I expect this has only happened rarely and it’s also not difficult to stop anyway – either research hotlinking prevention or use Wez’s ASCII captchas which are, by nature, not hotlinkable.

Although it’s possible to captchas in a secure and accessible manner, they’re still an extra step for legitimate users plus I believe they’re overkill for the problem. What’s required is not actually sifting out the human users but rather sifting out the legitimate user agents (web browsers) from the scripts…

Preventing Automation

For me there’s now enough anecdote to suggest that making your posting API a little more complex is enough to block scripts posting spam automatically.

One comment mentioned Pete Bowyer’s simple but effective solution, which requires a single extra step by users with a web browser but would need more than just LWP::Simple to be scripted.

Elsewhere a WordPress user described the immediate effect of simply renaming the POST url on spam. One of the comments following from that was particularly interesting;

The renaming trick works for most of the spam robots – as long as you remember to delete wp-comments-post.php off your server too as somebody mentioned :p There are however, a few robots out there which seem to parse the entire index.php file to find what the comments file name is, I’ve also changed the comment form variables but still a few get through probably because the robot parses the comments form and gets the variable names too. So, as somebody mentioned, this is like the cold war where you have to adapt to constantly keep ahead of the spammers.

For those that go so far as parsing forms, Spam Stopgap Extreme;

This prevents spammers from automatically scraping the form, because anyone wanting to submit a comment *must* execute the javascript md5.

That leaves spammers hunting a Javascript runtime they can use… Having suggested similar of course people pointed out some people surf with Javascript disabled. Another angle might be something like this;

…with a form like;

The knowledge of which form fields are actually meant to be filled in is contained in the CSS. If they get as far as parsing that, it could be made more difficult by relating styles to tags via CSS class selectors. The uniqueId in the POST URL identifies which set of fields contain the real data while a script which parses the form could be fooled into submitting data in the wrong fields, thereby identifying itself. Anyway – serves as yet another possible solution in the arms race…

Blacklisting

Thanks to a tip off from Amit, it turns out there is already a central service to help with blacklisting, described here. There’s also this WordPress plugin which uses some of the RBL (Realtime Blackhole) services which have evolved for dealing with email spam.

If we’re headed in that direction, I guess techniques that have been employed to combat email spam (e.g. Bayesian filters) are worth researching.

Regarding RBLs and blacklisting, this paper (the subject being email spam) highlights some of the problems. In fact, reading that, almost all of the problems being described, apart from “Collateral Damage and Legitimate Users”, relate to RBLs being centralized services.

Bearing that in mind, Marcus’s suggestion could well be the way to go;

RSS would provide a distributed solution.

Not just that, it attaches a name to the data, allowing “consumers” to pick who they trust for their blacklists, rather than a central service where data is provided anonymously.

There’s also a built in mechanism for keep the data fresh and managing bottlenecks. Each blogger keeps their own blacklist which is periodically updated from other people’s feeds. There’s probably a Web Service-killing insight hidden in there as well – something like: “A distributed and scalable Web is not a normalized Web” – but that’s another story…

Remove the Incentive

Simon pointed out how he uses redirects to eliminate PageRank, basically preventing the Googlebot from indexing them.

Personally I still think that eliminating PageRank is the best solution simply because it battles the economics of comment spam. As e-mail spam has shown, as long as there’s an economic incentive spammers will take more and more advanced steps to avoid filters and counter-measures.

Simon’s approach seems to have been highly effective, judging from the lack of spam he gets. Technically I guess this violates the principle of “no bars to legitimate use” – what if you want legitimate users to be able to post links and have Google associate page rank with it? It also assumes you’re dealing with “smart spammers” who realise what you’ve done – it’s not actually prevent spam and a “dumb spammer” may post anyway.

Markus made a similar remark;

There is a third party involved here that could do a lot to help. If we had a simple way of reporting the spam links to Google then the incentive could be destroyed at source. Google could drop any spam promoted website.

To an extent that’s already a possibility, as Simon described here.

Economics

Diana C. told the story of how she dealt with one comment spammer (at the end);

Within 24 hours, I got a response from a wholesale pill supplier, who explained that they received copies of the diet-pills web site’s emailed feedback, and they apologized for the spam, and told me that they were immediately discontinuing their wholesale relationship with the diet-pills web site because they have a strict anti-spam policy.

If that’s representative of comment spammers, they’re simply acting as (semi-authorized) middle-men in a marketing process. One non-technical approach may be to shift the pressure onto the suppliers with “naming and shaming” for those who fail to keep their own house in order.

Finally amusing economic spin, for those looking for opportunities, is Kitten