Programming - - By Harry Fuecks

Automated Blog Comment Spam?

Via SimonMT Plus Comment Spam Equals Dead Site. The subject of blog comment spam bothers me, not so much as a problem in itself but because there’s alot of people talking about it (and suffering from it) while, at the same time, little real technical analysis.

Have to say I don’t have first hand experience of dealing with blog comment spam (and Sitepoint administer these blogs) so perhaps I’m the wrong person to make suggestions, but going to do so anyway, from the standpoint of someone who knows the technologies involved. Shoot me if I’m wrong – preferably with technical reasons.

First have yet to fully answer for myself whether the problem is primarily human beings, manually posting spam, or automated processes (scripts)? I assume the answer is both but the bigger problem is the latter, given the volumes being generated in some places, to the point of denial of service against Movable Type.

In the former case; armies of sad gnomes paid to post links for pagerank, the only decent technical solution would seem to be “blacklisting” – maintaining lists of patterns (urls / words) which should be blocked from posts.

From a quick scan of what people are doing so far, no one seems yet (correct me if I’m wrong) to have established some kind of live blacklisting service, which is open to all to read and updateable (automated) by trusted bloggers. There’s already plenty of experience with XML-RPC based services in the blogosphere so it shouldn’t be a giant leap. It should be possible to build a service where an attempted comment spam on a single blog results in a blacklisting which propogates almost immediately to all other blogs subscribed to the service. Under that kind of scheme it may be possible to age out old entries which spammers have lost interest in (reducing processing overhead in searching huge lists).

In the latter case; automated spam via scripts, strikes me there’s room to make life very difficult for spam script developers (to the point of it not being worth the effort) by considering the nature of the scripts themselves and what’s involved to write them.

The most likely tools, from where I stand, for writing spam scripts are Perl plus LWP::UserAgent (or similar like LWP::Simple), PHP plus PEAR::HTTP_Request (or possibly Snoopy) and Python plus httplib. Perhaps Intenet Explorer via COM is being used?

To be able to write a spam script using these tools requires at least some knowledge of programming and the HTTP protocol plus time to write it. Sure you don’t have to be a genius and a simple script doesn’t take long to write but still it requires a little more talent and effort than “Hello World”.

I’d be reasonably willing to hazard a bet that the number of people actually using spam scripts is much higher than those writing them (what skilled developer wants to waste much time on this?). In other words someone writes the script then distributes it to a group who lack the skill to make significant modifications to it.

It’s also worth noting that spammers are focusing on blogs running apps like Moveable Type, which offer a standard HTTP API for posting comments. What that suggests to me is the spamming scripts are primitive, probably containing hard coded form field names and perhaps hard coded (relative) URLs to POST to. In other words varying the URLs / form fields on the server will break the scripts.

So number one would be for blog app vendors to make the comment API unique to a given installation of their application (e.g. generate in setup process).

Also, giving the server-side the ability to vary the form API on a per-request basis would present a moving target. For a browser which fetched a fresh copy of the comment form, this should be no problem but the basic script now needs modification.

The implementation could be as simple having a list of different comment field sets, each set with a unique identifier (even sent to the browser hidden form field) and make the list individual to the installation of a blogging app. Each time the form is displayed to a browser, the names of the form fields are different, selected by the server from the list. When the form is submitted, the unique identifier tells the server what form field names to expect.

The spam script will now needs to start parsing the the web page to extract the field names, increasing it’s complexity by an order of magnitude.

And to catch out scripts which are parsing the page, some random dummy form fields, visually hidden to a browser using CSS, could be used to identify and block the scripts.

The script now has to parse both HTML, CSS then work out how the CSS relates to the HTML – not something you can do with 5 minutes hacking.

There’s more that can be done by exploiting capilities that a browser has but a script hasn’t, perhaps the first place to look being Javascript. There basically isn’t a scripting language capable of fully interpreting Javascript and providing all the native Javascript objects a browser has. For starters, setting a cookie with Javascript, which the server will require before allowing a POST requires the script to both extract this information from Javascript and send the correct cookie header (more complexity required). And at the extreme end of the scale, XMLHttpRequest could be used to fetch some further critical pieces of information, to be allowed to post a comment, once the page has already loaded.

Thats just some specific ideas. What seems to be the situation right now is we’re looking for unbreakable solutions. Unfortunately come up with something which is both unbreakable and user-friendly is unlikely to happen.

Seems to me the easier way path is simply to get into a development arms race with spammers, which will be “invisible” to a normal visitor with a browser. Take it to the point where so much development time and skill is required to write a spamming tool that it’s no longer worth the effort. If someone does manage to write a spamming tool for your blog, at least you’ll know they were one of the core Mozilla development team.

Anyway – that’s the view as I see it from afar. Say the word if it’s wrong.

Sponsors