Bad Bots Code for Keeping Bad Bots Out Not Working

Sitepoint Members,
In awstats I see google bots and other wanted bots but I also I see what I assume are “bad bots” :

Unknown robot (identified by ‘spider’)
Unknown robot (identified by ‘bot*’)
Unknown robot (identified by ‘crawl’)
Unknown robot (identified by hit on ‘robots.txt’)
Unknown robot (identified by ‘robot’)

I used these different forms of htaccess file code to keep the bad bots out but they have no effect on the bad bots:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^spider$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^crawl$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^robot$
RewriteRule ^(.
)$ - [F,L]

SetEnvIfNoCase User-Agent “^spider” bad_bot
Order Allow,Deny
Allow from All
Deny from env=bad_bot

SetEnvIfNoCase ^User-Agent$ .(spider|bot|robot||bot|crawl) HTTP_SAFE_BADBOT

Even if I could get some code that will work on some of the bad bots that would be great. The harder bad bots to keep out migh be the “bot", "bot” (probably named after the football player Bart Star) and “identified by hit on ‘robots.txt’”

Thanks,

Chris


I searched for the answer n sitepoint:
http://search.sitepoint.com/?q=bad+bots&submit=Search&refinements[forums]=1

and found
http://environmentalchemistry.com/badbots/gohere.html
for obviously http://environmentalchemistry.com

but didn’t see any htaccess or other code in the thread talking about it to make it work.

Chris,

Your code should work but ONLY if the bots were identified EXACTLY as you’ve specified, i.e., get rid of the start and end anchors to get broad coverage.

Regards,

DK

Dklynn,
Would the anchors be the “^” and “$” ? If not, please let me know. Which of these forms of code do you think works best?

Thanks Much,

Chris

Chris,

Yes, the start anchor (“starts with”) is the ^ which is NOT a character but denotes the start of a string; ditto the $ for the end (“ends with”) of the string. Using both means that the string must be exactly what is between the two anchors. Using neither means “contains.”

Regards,

DK

DKlynn,
Thanks for the help. Do you feel all 3 of these methods work equally well, or do you have a preference?

Also, it seems that according to the person who programmed http://environmentalchemistry.com/badbots/gohere.html
the htaccess methods aren’t enough. Do you know what methods he uses?

Thanks,

Chris

Chris,

Methods? No, they’re using the POWER of regular expressions to identify the 'bot as a 'bot. Since so many of them can (and DO) vary their “signature,” it’s important NOT to look for a single “signature” from them but look for the part that is always there, i.e., the “contains” version (without start or end anchor).

As for your link, it’s detected (likely using mod_rewrite’s {HTTP_REFERER}) that I had not visited their ‘link from’ page and redirect me to their “bad bot” script. A good example of mod_rewrite at work!

Regards,

DK

DK,
When I say “method” I’m referring to these three methods
1)RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^spider$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^crawl$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^robot$
RewriteRule ^(.
)$ - [F,L]

2)SetEnvIfNoCase User-Agent “^spider” bad_bot
Order Allow,Deny
Allow from All
Deny from env=bad_bot

  1. SetEnvIfNoCase ^User-Agent$ .(spider|bot|robot||bot|crawl) HTTP_SAFE_BADBOT

Do you feel all 3 of these methods work equally well, or do you have a preference?

I definitely will be removing the ^ and $ from which ever of these methods I use.

I’m not sure I understand the last line
"… it detected (likely using mod_rewrite’s {HTTP_REFERER}) that I had not visited their ‘link from’ page and redirect me to their “bad bot” script. Do you have any idea if the anti bad bot code he/she is using is better than the .htaccess code (methods) we were discussing above?

Thanks DK,

Chris

Chris,

Okay, gotcha!

BECAUSE you have to do SOMETHING to identify the 'bots, I prefer the first (in httpd.conf) because the list is an ever-growing one and the format of the first is very clear what you’re doing to identify each 'bot. #2 does the same thing with a lot more effort as does #3. An advantage to #2 is that there are lists of 'bots in this format that you can modify to use in your own code (correcting the anchor problem).

I would REALLY advise that these be used in the server configuration file, though, as it’s a long lost to read and parse on every request.

Because you have access to the server configuration file, though, I’d recommend a RewriteMap to identify the culprits to ban as (1) there is far less code and (2) it should be easier to maintain as a map (text file of pairings - argument and return value) rather than mod_rewrite code.

Regards,

DK

Dave,
So I would have this
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} spider [OR]
RewriteCond %{HTTP_USER_AGENT} bot* [OR]
RewriteCond %{HTTP_USER_AGENT} bot [OR]
RewriteCond %{HTTP_USER_AGENT} crawl [OR]
RewriteCond %{HTTP_USER_AGENT} robot
RewriteRule ^(.
)$ - [F,L]

What about that last line, would I take out ^ and $ there too?

Do you have any idea how the programmer of
this page and site
http://environmentalchemistry.com/badbots/gohere.html
http://environmentalchemistry.com

are keeping bad bots out?

Thanks,

Chris

Chris,

What do the ^ and $ do for you?

Okay, assuming you don’t know, the ^ is the start anchor, the $ is the end anchor so using them to bracket the :kaioken: EVERYTHING :kaioken: atom is redundant. Worse yet is the fact that you’re capturing the string then not doing a thing with it. What you need in place of ALL the regex in that RewriteRule is ‘.?’, i.e., one optional character. I consider .? to be a “placeholder” regex because you don’t care what you’re matching (in the {REQUEST_URI} string), you just want the RewriteRule to execute the redirection (which is constrained only by the RewriteCond statements in your block statement). Good question!

For you, what in the world are the 's supposed do to in your RewriteCond statements? I know that bot means bo followed by zero or more t’s. What does * mean when there is no character for this metacharacter to operate on? Again, if you don’t know, it’s undefined and will (possibly) generate an error in the parser. Why not eliminate one of them and the * in the remaining one - all you want to see is that bot is contained, isn’t it?

As for environmentalchemistry, the beauty of mod_rewrite is that it’s all done on the server side so there’s no way to tell HOW the magic is being performed. Since I’d taken my guess above …

Regards,

DK

Chris,

My apologies (sort of) for hammering you about the ‘bot’ and 'bot’ because those ARE how some bots are identified! If you want to match a metacharacter (and * IS a metacharacter), you must escape it, i.e., \*. That said, all you really need is ‘bot’ (contains ‘bot’) as above, though, so the effect will be to combine the two statements without worrying about escaping “illegal” characters.

Because that would also cover your robot line and you’d want to get rid of the empty string for {HTTP_USER_AGENT}, too, I’d recommend using:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} spider [OR]
RewriteCond %{HTTP_USER_AGENT} bot [OR]
RewriteCond %{HTTP_USER_AGENT} crawl
RewriteRule .? - [F,L]

However, I’m not sure that the Last flag is required (Apache.org doesn’t show it but their examples may not be valid for .htaccess with more RewriteRules) but it shouldn’t hurt to leave it in.

[edit]WARNING: Unlike Chris, I left the [OR] flag on the last RewriteCond which enabled the Fail on EVERY request (every request matched .? as intended for the redirect). This proves that I make mistakes, too, but I did find the error quickly. Consider this an object lesson for everyone!

As for the Last flag mentioned above, it didn’t affect the result of my request either way so it can work with just the F but also works with the F and Last flag. Take your pick, I suppose - how unusual![/edit]
Regards,

DK

Dave,
I’ll put that code in and tell you the results later this week. You’re the only person I’ve talked so far that I could find that knows much about rewrites and bad bots in htaccess.

Thanks,

Chris

Dave,

You were saying that ^ before and $ after a the name of a bot (e.g. car) constrains the command to that exact name, car, and won’t effect bots named, say carry nor vicar. If you remove the ^ and $ I would guess maybe bots named carry and vicar will be affected. Of course you were talking about signatures and I’m not sure what a signature is in this context. So my worry is, will removing ^ and $ from ^bot$ adversly affect the important bot googlebot, which of course “bot” in its name ?

Thanks,

Chris

Chris,

Yes, bot without the anchors WILL catch “bot” even within other words. If you want to allow a specific one, it’ll take care to match that to “punch a hole” in the list - or go back to escaping the metacharacters.

Regards,

DK

Dave,
So this would be it?:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot$ [OR]
RewriteCond %{HTTP_USER_AGENT} crawl
RewriteRule .? - [F,L]

Thanks Much,

Chris

Dave,
I put in
RewriteCond %{HTTP_USER_AGENT} ^bot*$ [OR]

Chris

Chris,

ESCAPE THE * METACHARACTER!

I.e., \bot and bot\ to match ‘bot’ and 'bot’.

Regards,

DK

Dave,
Woops. You had said, "If you want to match a metacharacter (and * IS a metacharacter), you must escape it, i.e., \*. " Too much thinking, not enough listening on my part.

I put in
RewriteCond %{HTTP_USER_AGENT} ^\bot$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^bot\
$ [OR]

The back slash goes right before the asterik.

Thanks Dave,

Chris

Chris,

Still not listening, my friend! Drop the start anchors (unless you really need them). Ditto the end anchors as these will only match the substring ‘bot’ or 'bot’ and NOT require an exact match with the entire bot id string.

Regards,

DK