What are my options to block visitors that are not real web users?

Hi,

When I check the Hosts section in my Awstats, I see a lot of IP addresses from many countries that have Pages and Hits count mostly equal. For example one IP has 126 pages and 126 hits. Another one has 223 pages and 225 hits. Regular visitors seem to have like 2 Pages and 8 hits, 3 pages and 14 hits etc.

Considering my pages should have 3+ hits in average due to style, script and image files, I am assuming these visitors are not real visitors and they are scrapers, crawlers and the like. These visitors seem to be consuming the majority of my bandwidth and I don’t want them to continue stealing my bandwidth, assuming they are not Google or other search engine bots.

I know I can block IPs manually or by blocks in .htaccess, however it won’t be a real solution as most such people change their IPs frequently. What are my options to algorithmically block such visitors?

One option that comes to my mind is to check if the visitor is accessing the site via a real browser. Not sure if it will be easy considering the large amount of browsers and also I guess it’s possible to fake it.

Another option comes to my mind is to block an IP algorithmically if they send a certain number of hits within a short period. For example, if an IP sends 20 hits in 5 seconds, block it for a certain period of time. Is this possible?

Any other options that I can’t think of?

Thanks.

One method I have used is the “Blackhole” script discussed here. It detects crawlers that ignore the robots.txt and crawl where they shouldn’t, and bans their IP automatically.
I also added some script to some pages which access databases which detect url tampering and forwards them to the blackhole too.

I’m sure it’s possible. You need to be able to detect the bot’s behaviour, whatever that is, that differs from real users, then act upon it. So any script you create which detects such behaviour and forward them to the blackhole.
You may also find this article interesting.

1 Like

I used an earlier version of Crawl Protect, which seemed quite effective. I haven’t used the current one, though.

You might also want to look into the Blackhole trap discussed in this thread: Blackhole trap for bad bots

(Edit: ninja’d! )

1 Like

Just a word of warning about detecting bots. You need to be sure you don’t ban the “useful” bots, like Google etc… as that could affect ranking or indexing for that matter.
So you need a method that only picks out ‘bad’ bots, by detecting bad behaviour. So now I’m not so sure about timing the hits.

Thank you both for the tips and links. I will check them and see what I can do. I am basically looking for a simple solution that will not require installing dozens of files or any manual interaction. Blackhole Trap approach seems interesting, I may go with that one.

I know it is not 100% possible to have a perfect prevention system, just trying to learn my options and come up with something that is easy to implement, will be effective to an extent and will not affect my site negatively.

[quote=“SamA74, post:4, topic:214242, full:true”]Just a word of warning about detecting bots. You need to be sure you don’t ban the “useful” bots, like Google etc… as that could affect ranking or indexing for that matter.
So you need a method that only picks out ‘bad’ bots, by detecting bad behaviour. So now I’m not so sure about timing the hits.[/quote]

Yes, unknowingly blocking legit search engine bots is my major concern. Seems, the Blackhole approach you mentioned takes care of that nicely.

Another thing to ponder before you spend a lot of time working on this - why are you trying to block these hits? If you have a reason that’s fine, but other than slightly higher resource usage, even in the hundreds or thousands. If you have hard numbers that indicate you’re using most of an allotment of bandwidth, then that’s different though. Just saying make sure you know it’s worth spending time on / risking blocking unintended things before you do so :smiley:

That all said, I’m a fan of the blackhole method, there’s a lot of variations I think, too.

[quote=“jeffreylees, post:6, topic:214242, full:true”]
why are you trying to block these hits?[/quote]

Well, bandwidth is not free. I don’t want my site or bandwidth being used like that. It’s not in a grand scale at the moment but I am building and growing multiple sites and eventually it will become a big issue, even if not financially, it will keep my server/websites busy unnecessarily. I wish I didn’t need to spend time and energy on this but we have this issue with no real solution so I’m trying my best to take measures against it. Based on a recent article I read, since a couple of years, the amount of all web traffic generated by such use (scrapers, bots etc.) has been more than normal user traffic.

One concern about the Blackhole approach: It will work against bots that don’t check the robots.txt file, but what if the scraper checks the robots.txt file, and does not visit or skip any pages that are forbidden in the robots.txt file? EDIT: Perhaps having /sample-page-a/ in the robots.txt file and having the hidden link as /sample-page-b/ and redirecting via .htaccess so that the bot can’t make a match?

They do suggest adding an invisible nofollow link to your site for this purpose.

I won’t guard against all bots, it specifically targets naughty bots that ignore nofollow and robots.txt
But the script is fairly simple. If you have a basic knowledge of php, it can be modified to act on other suspicious behaviour, which is exactly what I did with it.

My point was, since the approach has been public for a while, the scrapers are also educated about it and so they can still craft naughty bots that check robots.txt files, not to obey them but to adjust their scraping and skip the prohibited pages in order not to fall into the trap, and still scrape your site. That’s why I thought about having different links (as trap link and as hidden link) and having a redirection, so the bot will not know it is following into a trap till it ends up in the trap.

EDIT: The nofollow attribute on the hidden link will still prevent legit bots to follow it, hence not to fall into the trap.

I know PHP. Is there a possibility for me to see your script, to get some ideas if possible? If you wouldn’t want to make it public, you could use the private message if you would like to. Thanks.

But how will you stop the legitimate bots falling into the trap?

As I said, there is no, AFAIK, 100% way to stop scrapers, but reducing is better than nothing. If bots are not doing something wrong, it’s hard to pick them out. A scraper only needs to read your html, just like any normal use or bot, so I don’t know how you stop that.
It could be some bots are wise to such traps, I made a point of not calling my blackhole directory “blackhole”, as it’s a bit obvious. Again, catching a few is better than none.

There isn’t that much I changed in the actual blackhole script. Mostly just to differentiate between what was detected, then recording and reporting it.
The other bit are on the pages themselves to detect suspicious activity, then a simple header redirect to the blackhole directory.
The other thing I’m detecting is attempted sql injection on pages with an id variable.
There is the usual preg_replace on the $_GET['id'] to strip out anything that’s not an integer (0-9) which gives the id to use if all is well. But also another which strips out all but unwanted chars, like letters, equals, quotes and suchlike. If that string is not empty, you get the boot. Basic stuff, but effective.
I also put blackhole forwarders in my form processing scripts, in case you trip any security alerts with bot-like behaviour there.

Will nofollow attribute not work for that? Say, I have in robots.txt:

User-agent: *
Disallow: /blocked/

and I have on my page:

<div style="display:none"><a href="/read-more/" rel="nofollow">Read more...</a></div>

in .htaccess:

Redirect /read-more/ /blocked/

Going one step further, the display:none style could be applied via JS, so the wiser bots will not have an idea if they are programmed to skip links in display:none elements.

Step 4 describes how to setup the link trap.

Yes, it does. In my previous reply, I demonstrated how I would do it as a reply to your question, “But how will you stop the legitimate bots falling into the trap?”. So, I didn’t understand why you pointed me to Step 4.

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.