A few tens of MB is not so much bandwidth. Also, depends of the size of your files included in the page content, as “.css” and “.js” files, and images.
The bots can be from search engines, like google, alexa, … etc.
Any way, try look on the net for: "prevent hotlinking ".
This is normal… in most cases you want to encourage regular visits by search robots. @MarPlo ; was correct that you want to look at ways to block hot linking. You also want to ensure that your ISP host using intrusion detection and most likely use a SNORT account to filter unwanted traffic. This is normally done at your ISP’s firewall level.
Encouraging visits by Google, Bing/Yahoo, and other search engines is one thing. But there are bots out there that you do not want sucking down your data transfer such as Brandwatch. (And for me, Yandex, Baidu, and other non-U.S. bots.) I’ve had bots suck down almost 2 GB of data in a single day. I block the ones I don’t want using htaccess.
There are other methods to attempt to stop scrapers from downloading your entire site. I don’t use them as of yet.
I block by user agent string and they haven’t changed. But then, these are “legitimate” bots (like Brandwatch). The kind that aren’t trying to hide what they are. The scrapers and other bots trying to harvest your content are not going to be stopped using htaccess unless you can block their IP range.
About the only thing I can think of to stop a scraper bot is to store information in a database about it such as its IP address with a little logic to see how many pages it is accessing. I’ve had scrapers download upwards of 5 pages a second. While I’ve considered writing the code to prevent this, I haven’t yet done it. I don’t think it would be too hard, though. It would require database access on every pageview, though. Which is a small performance hit.
Thanks, and it is generally not desirable to block entire I.P ranges, so .htaccess isn’t the most comprehensive solution.
I prefer using my firewall’s intrusion detection to block high rates from a single host/User Agent and even DOS attacks. I also use SNORT and sure sometimes I have to tune it or block some particularly troublesome I.P.s but overall this is easiest for what I do.
If your target market is within your country, no sense in allowing hackers from China, Romania, etc., to even visit your website. Therefore, I’d disagree with your statement about block IP addresses. Too easy to proxy attacks but that forces another thing for hackers (bots) to do.
Saying that, I agree that your firewall is the optimum solution. :tup:
You can’t (obviously). That’s where the “target market” comes in. Sitting off in NZ, many of my clients are marketing ONLY to NZ so I can easily test for NZ IP addresses and block everyone else (if that’s what the client wants, of course). Not many proxy servers in NZ, too, so that makes my task easier.
Back to your question, though, you can’t. It then becomes a trade for the client whether to block of allow bots. With the advantage to bots (changing user agent strings and using proxies), I don’t recommend blocking, ergo my support for your firewall.
Well some large site operators do block entire IP ranges and often all IP ranges associated with problem countries. This is often better off done at the IP level using iptables or its equivalent. The main problem that large sites tend to have is from scrapers trying to download the site’s entire content. Many scrapers operate from hosting/vpn IP ranges and consequently, these ranges are often candidates to be blocked. With large sites, the standard operating procedure is to block on detection. The only thing that would concern the admin of a large site is whether it is more efficient to block with .htaccess or a software firewall at IP level.
Polite crawlers will generally obey robots.txt and this can be a good first line of defence for those that play by the rules. With blocking by .htaccess, if you are going to block a ‘legitimate’ searchengine crawler, it may make some sense to exclude robots.txt from the block. Enterprise firewalls are good but they can be expensive. Sometimes a good blocklist, iptables and perhaps mod_security can handle a lot of the problems a site will encounter.
Yes I have for the last 7 years used PFSense an original fork off of Monowall. PFSense is completely open source, has i.p. chaining as a core, intrusion detection, CARP(combining multiple WAN connects into one larger pipe with failover, and Virtual LANs (when having switches that can support this). You are quite correct though that iptables and mod_security can go quite far, still I prefer PFSense
I’ve tried that one on a couple of sites and it seems very effective. I’ve also used this, which blocks bad bots, but doesn’t automatically ban the IP. It does, however, record the IP and give you the option to ban it manually.