SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 37
  1. #1
    SitePoint Enthusiast
    Join Date
    Jan 2012
    Posts
    29
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Bots and Bandwidth

    Hello guys. please take a look and give me advice.

    Viewed traffic * 2,229
    5,069
    (2.27 visits/visitor) 19,189
    (3.78 Pages/Visit) 56,236
    (11.09 Hits/Visit) 413.92 MB
    (83.61 KB/Visit)
    Not viewed traffic *
    19,803 25,698 142.91 MB

    and this.

    Robots/Spiders visitors (Top 25) - Full list - Last visit
    16 different robots* Hits Bandwidth Last visit
    Unknown robot (identified by 'bot*') 3,162+383 36.68 MB 23 Jul 2012 - 07:29
    Unknown robot (identified by 'robot') 1,795+43 12.11 MB 23 Jul 2012 - 06:59
    Unknown robot (identified by '*bot') 917+489 24.00 MB 23 Jul 2012 - 01:50
    Googlebot 1,304+54 5.96 MB 23 Jul 2012 - 07:26
    Unknown robot (identified by 'spider') 961+84 6.03 MB 23 Jul 2012 - 05:11
    Unknown robot (identified by empty user agent string) 754+15 9.98 MB 23 Jul 2012 - 07:17
    Unknown robot (identified by 'crawl') 662+50 7.78 MB 23 Jul 2012 - 07:29
    Unknown robot (identified by 'checker') 425 11.65 MB 19 Jul 2012 - 01:18
    MSNBot 274+28 1.71 MB 23 Jul 2012 - 06:24
    Unknown robot (identified by hit on 'robots.txt') 0+263 80.57 KB 23 Jul 2012 - 06:38
    Alexa (IA Archiver) 144+46 3.23 MB 22 Jul 2012 - 21:06
    Yahoo Slurp 132+41 917.40 KB 23 Jul 2012 - 04:55
    Voyager 8 0 19 Jul 2012 - 07:48
    Voila 4+3 29.59 KB 20 Jul 2012 - 19:15
    MSNBot-media 1+4 5.79 KB 19 Jul 2012 - 19:07
    Netcraft 1 24.83 KB 06 Jul 2012 - 12:55
    * Robots shown here gave hits or traffic "not viewed" by visitors, so they are not included in other charts. Numbers after + are successful hits on "robots.txt" files.
    i think this to much bandwidth, so how to minimal the bandwidth at Bot??thanks guys

  2. #2
    SitePoint Addict
    Join Date
    Apr 2011
    Posts
    265
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi,
    A few tens of MB is not so much bandwidth. Also, depends of the size of your files included in the page content, as ".css" and ".js" files, and images.
    The bots can be from search engines, like google, alexa, ... etc.
    Any way, try look on the net for: "prevent hotlinking ".
    Free: Web Programming Courses HTML, CSS, Flash
    Web Programming: AJAX Course and PHP-MySQL Course video Lessons
    Good JavaScript and jQuery course for beginners

  3. #3
    SitePoint Enthusiast
    Join Date
    Jan 2012
    Posts
    29
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    really??
    so this normal...

    i though that was not normal..heheh...sorry bothering you.

  4. #4
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by Ellysdirectory View Post
    really??
    so this normal...

    i though that was not normal..heheh...sorry bothering you.
    Hi

    This is normal... in most cases you want to encourage regular visits by search robots. @MarPlo ; was correct that you want to look at ways to block hot linking. You also want to ensure that your ISP host using intrusion detection and most likely use a SNORT account to filter unwanted traffic. This is normally done at your ISP's firewall level.

    You can find out a way using an apache host to filter unwanted hot linking in dklynn's tutorial on mod_rewrite

    Regards,
    Steve
    ictus==""

  5. #5
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,853
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    Encouraging visits by Google, Bing/Yahoo, and other search engines is one thing. But there are bots out there that you do not want sucking down your data transfer such as Brandwatch. (And for me, Yandex, Baidu, and other non-U.S. bots.) I've had bots suck down almost 2 GB of data in a single day. I block the ones I don't want using htaccess.

    http://www.thesitewizard.com/apache/...htaccess.shtml

    There are other methods to attempt to stop scrapers from downloading your entire site. I don't use them as of yet.

  6. #6
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by cheesedude View Post
    Encouraging visits by Google, Bing/Yahoo, and other search engines is one thing. But there are bots out there that you do not want sucking down your data transfer such as Brandwatch. (And for me, Yandex, Baidu, and other non-U.S. bots.) I've had bots suck down almost 2 GB of data in a single day. I block the ones I don't want using htaccess.

    http://www.thesitewizard.com/apache/...htaccess.shtml

    There are other methods to attempt to stop scrapers from downloading your entire site. I don't use them as of yet.
    How often do the unwanted bots that you try to block change their agent string or their I.P.?
    ictus==""

  7. #7
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,853
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by ServerStorm View Post
    How often do the unwanted bots that you try to block change their agent string or their I.P.?
    I block by user agent string and they haven't changed. But then, these are "legitimate" bots (like Brandwatch). The kind that aren't trying to hide what they are. The scrapers and other bots trying to harvest your content are not going to be stopped using htaccess unless you can block their IP range.

    About the only thing I can think of to stop a scraper bot is to store information in a database about it such as its IP address with a little logic to see how many pages it is accessing. I've had scrapers download upwards of 5 pages a second. While I've considered writing the code to prevent this, I haven't yet done it. I don't think it would be too hard, though. It would require database access on every pageview, though. Which is a small performance hit.

  8. #8
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by cheesedude View Post
    I block by user agent string and they haven't changed. But then, these are "legitimate" bots (like Brandwatch). The kind that aren't trying to hide what they are. The scrapers and other bots trying to harvest your content are not going to be stopped using htaccess unless you can block their IP range.
    Thanks, and it is generally not desirable to block entire I.P ranges, so .htaccess isn't the most comprehensive solution.

    Quote Originally Posted by cheesedude View Post
    About the only thing I can think of to stop a scraper bot is to store information in a database about it such as its IP address with a little logic to see how many pages it is accessing. I've had scrapers download upwards of 5 pages a second. While I've considered writing the code to prevent this, I haven't yet done it. I don't think it would be too hard, though. It would require database access on every pageview, though. Which is a small performance hit.
    I prefer using my firewall's intrusion detection to block high rates from a single host/User Agent and even DOS attacks. I also use SNORT and sure sometimes I have to tune it or block some particularly troublesome I.P.s but overall this is easiest for what I do.

    Thanks,

    Steve
    ictus==""

  9. #9
    Certified Ethical Hacker silver trophybronze trophy dklynn's Avatar
    Join Date
    Feb 2002
    Location
    Auckland
    Posts
    14,672
    Mentioned
    19 Post(s)
    Tagged
    3 Thread(s)
    SS,

    If your target market is within your country, no sense in allowing hackers from China, Romania, etc., to even visit your website. Therefore, I'd disagree with your statement about block IP addresses. Too easy to proxy attacks but that forces another thing for hackers (bots) to do.

    Saying that, I agree that your firewall is the optimum solution.

    Regards,

    DK
    David K. Lynn - Data Koncepts is a long-time WebHostingBuzz (US/UK)
    Client and (unpaid) WHB Ambassador
    mod_rewrite Tutorial Article (setup, config, test & write
    mod_rewrite regex w/sample code) and Code Generator

  10. #10
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by dklynn View Post
    SS,

    If your target market is within your country, no sense in allowing hackers from China, Romania, etc., to even visit your website. Therefore, I'd disagree with your statement about block IP addresses. Too easy to proxy attacks but that forces another thing for hackers (bots) to do.
    Good point! However how do you best know that the I.P.s/Proxy I.P.s to block without blocking legitimate traffic?
    ictus==""

  11. #11
    Certified Ethical Hacker silver trophybronze trophy dklynn's Avatar
    Join Date
    Feb 2002
    Location
    Auckland
    Posts
    14,672
    Mentioned
    19 Post(s)
    Tagged
    3 Thread(s)
    Steve,

    You can't (obviously). That's where the "target market" comes in. Sitting off in NZ, many of my clients are marketing ONLY to NZ so I can easily test for NZ IP addresses and block everyone else (if that's what the client wants, of course). Not many proxy servers in NZ, too, so that makes my task easier.

    Back to your question, though, you can't. It then becomes a trade for the client whether to block of allow bots. With the advantage to bots (changing user agent strings and using proxies), I don't recommend blocking, ergo my support for your firewall.

    Regards,

    DK
    David K. Lynn - Data Koncepts is a long-time WebHostingBuzz (US/UK)
    Client and (unpaid) WHB Ambassador
    mod_rewrite Tutorial Article (setup, config, test & write
    mod_rewrite regex w/sample code) and Code Generator

  12. #12
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by dklynn View Post
    Steve,

    You can't (obviously). That's where the "target market" comes in. Sitting off in NZ, many of my clients are marketing ONLY to NZ so I can easily test for NZ IP addresses and block everyone else (if that's what the client wants, of course). Not many proxy servers in NZ, too, so that makes my task easier.

    Back to your question, though, you can't. It then becomes a trade for the client whether to block of allow bots. With the advantage to bots (changing user agent strings and using proxies), I don't recommend blocking, ergo my support for your firewall.

    Regards,

    DK
    Thanks DK!
    ictus==""

  13. #13
    SitePoint Enthusiast
    Join Date
    Aug 2008
    Location
    Ireland
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ServerStorm View Post
    Thanks, and it is generally not desirable to block entire I.P ranges, so .htaccess isn't the most comprehensive solution.
    Well some large site operators do block entire IP ranges and often all IP ranges associated with problem countries. This is often better off done at the IP level using iptables or its equivalent. The main problem that large sites tend to have is from scrapers trying to download the site's entire content. Many scrapers operate from hosting/vpn IP ranges and consequently, these ranges are often candidates to be blocked. With large sites, the standard operating procedure is to block on detection. The only thing that would concern the admin of a large site is whether it is more efficient to block with .htaccess or a software firewall at IP level.

    Regards...jmcc
    http://www.hosterstats.com
    Domain Hosting History and Domain Statistics.
    http://www.hosterstats.com/blog
    HosterStats.com Blog - Knowledge From Numbers

  14. #14
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by jmccormac View Post
    Well some large site operators do block entire IP ranges and often all IP ranges associated with problem countries. This is often better off done at the IP level using iptables or its equivalent. The main problem that large sites tend to have is from scrapers trying to download the site's entire content. Many scrapers operate from hosting/vpn IP ranges and consequently, these ranges are often candidates to be blocked. With large sites, the standard operating procedure is to block on detection. The only thing that would concern the admin of a large site is whether it is more efficient to block with .htaccess or a software firewall at IP level.

    Regards...jmcc
    For this I would definitely use an Enterprise firewall; far more control.

    Regards,
    Steve
    ictus==""

  15. #15
    SitePoint Enthusiast
    Join Date
    Aug 2008
    Location
    Ireland
    Posts
    53
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ServerStorm View Post
    For this I would definitely use an Enterprise firewall; far more control.
    Polite crawlers will generally obey robots.txt and this can be a good first line of defence for those that play by the rules. With blocking by .htaccess, if you are going to block a 'legitimate' searchengine crawler, it may make some sense to exclude robots.txt from the block. Enterprise firewalls are good but they can be expensive. Sometimes a good blocklist, iptables and perhaps mod_security can handle a lot of the problems a site will encounter.

    Regards...jmcc
    http://www.hosterstats.com
    Domain Hosting History and Domain Statistics.
    http://www.hosterstats.com/blog
    HosterStats.com Blog - Knowledge From Numbers

  16. #16
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by jmccormac View Post
    Polite crawlers will generally obey robots.txt and this can be a good first line of defence for those that play by the rules. With blocking by .htaccess, if you are going to block a 'legitimate' searchengine crawler, it may make some sense to exclude robots.txt from the block. Enterprise firewalls are good but they can be expensive. Sometimes a good blocklist, iptables and perhaps mod_security can handle a lot of the problems a site will encounter.

    Regards...jmcc
    Yes I have for the last 7 years used PFSense an original fork off of Monowall. PFSense is completely open source, has i.p. chaining as a core, intrusion detection, CARP(combining multiple WAN connects into one larger pipe with failover, and Virtual LANs (when having switches that can support this). You are quite correct though that iptables and mod_security can go quite far, still I prefer PFSense
    ictus==""

  17. #17
    Non-Member
    Join Date
    Oct 2007
    Location
    United Kingdom
    Posts
    622
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Couldn't you set up a trap for bots that don't obey robot.txt?

    You could write a disallow statement for a trap page (which isn't important for SEO), and if a bot visits it, log the ip and the block the ip, and the string name?

  18. #18
    Life is not a malfunction gold trophysilver trophybronze trophy
    TechnoBear's Avatar
    Join Date
    Jun 2011
    Location
    Argyll, Scotland
    Posts
    6,423
    Mentioned
    274 Post(s)
    Tagged
    5 Thread(s)
    You mean something like this? Although it just blocks the IP, not the user-string.

  19. #19
    Non-Member
    Join Date
    Oct 2007
    Location
    United Kingdom
    Posts
    622
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by TechnoBear View Post
    You mean something like this? Although it just blocks the IP, not the user-string.


    Anyone had any experiences with this method of protection?

  20. #20
    Life is not a malfunction gold trophysilver trophybronze trophy
    TechnoBear's Avatar
    Join Date
    Jun 2011
    Location
    Argyll, Scotland
    Posts
    6,423
    Mentioned
    274 Post(s)
    Tagged
    5 Thread(s)
    I've tried that one on a couple of sites and it seems very effective. I've also used this, which blocks bad bots, but doesn't automatically ban the IP. It does, however, record the IP and give you the option to ban it manually.

  21. #21
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by TechnoBear View Post
    You mean something like this? Although it just blocks the IP, not the user-string.
    Very nice recommendation @TechnoBear ; . I reviewed the PHP and approach and it is pretty solid. This is great if you don't have a high-featured firewall and could also be used in conjunction with firewalls.
    ictus==""

  22. #22
    Non-Member
    Join Date
    Jun 2012
    Posts
    160
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by TechnoBear View Post
    I've tried that one on a couple of sites and it seems very effective. I've also used this, which blocks bad bots, but doesn't automatically ban the IP. It does, however, record the IP and give you the option to ban it manually.
    Wow. That is a solid option and I was surprised it was free! The administrator's dashboard or "panel" is probably the most advanced I've seen - Not that the smaller PHP options were any competition though.

  23. #23
    Certified Ethical Hacker silver trophybronze trophy dklynn's Avatar
    Join Date
    Feb 2002
    Location
    Auckland
    Posts
    14,672
    Mentioned
    19 Post(s)
    Tagged
    3 Thread(s)
    MBAs are taught to target their audience; the same applies to websites. If you have a client who, by virtue of their product or service, has a very limited target market, then consider allowing only IP blocks which are in that targetted location? I just signed up for a free account at http://ipinfodb.com to use their country service (NZ is a small target market) to limit displaying a contact form to Kiwi-based IPs and it works a treat!

    Regards,

    DK
    David K. Lynn - Data Koncepts is a long-time WebHostingBuzz (US/UK)
    Client and (unpaid) WHB Ambassador
    mod_rewrite Tutorial Article (setup, config, test & write
    mod_rewrite regex w/sample code) and Code Generator

  24. #24
    Non-Member
    Join Date
    Oct 2007
    Location
    United Kingdom
    Posts
    622
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by dklynn View Post
    MBAs are taught to target their audience; the same applies to websites. If you have a client who, by virtue of their product or service, has a very limited target market, then consider allowing only IP blocks which are in that targetted location? I just signed up for a free account at http://ipinfodb.com to use their country service (NZ is a small target market) to limit displaying a contact form to Kiwi-based IPs and it works a treat!

    Regards,

    DK
    What about SEO? Aren't you relying on search engines having servers in the same areas as your audience?

  25. #25
    Certified Ethical Hacker silver trophybronze trophy dklynn's Avatar
    Join Date
    Feb 2002
    Location
    Auckland
    Posts
    14,672
    Mentioned
    19 Post(s)
    Tagged
    3 Thread(s)
    bear,

    SEOs are world-wide but some specialize in a specific locality. Good point, though, so you'd need to "punch a hole" based on the SE's you want to invite in.

    Regards,

    DK
    David K. Lynn - Data Koncepts is a long-time WebHostingBuzz (US/UK)
    Client and (unpaid) WHB Ambassador
    mod_rewrite Tutorial Article (setup, config, test & write
    mod_rewrite regex w/sample code) and Code Generator


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •