SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Enthusiast Shane Is My Name's Avatar
    Join Date
    Oct 2009
    Location
    New York
    Posts
    65
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    bots are causing high memory usage and my host wants me to fix

    im getting warnings from my shared host (lunarpages). saying my cpu & memory usage is high. im guessing they want me off shared. but looks like a lot of the traffic is not even from humans... how do you prevent the bots from accessing 53,000 hits? lol. i know I can ip deny individual bots, but I keep doing that, and then lunarpages sends me another list of the new top ten ips... is this just something everyone deals with?


    CPU Usage - %7.53
    MEM Usage - %1.23
    Number of MySQL procs (average) - 0.14
    Top Process %CPU 66.00 [php]
    Top Process %CPU 65.00 [php]
    Top Process %CPU 31.50 [php]


    Top 10 of 88026 Total Sites By KBytes
    # Hits Files KBytes Visits Hostname
    1 53683 3.61% 0 0.00% 703394 2.79% 0 0.00% 113.128.7.109
    2 24799 1.67% 0 0.00% 421146 1.67% 0 0.00% 219.139.116.166
    3 30716 2.07% 0 0.00% 414266 1.64% 0 0.00% 113.128.31.117
    4 15159 1.02% 0 0.00% 270894 1.07% 0 0.00% 119.130.163.150
    5 15309 1.03% 1 0.00% 259043 1.03% 1 0.00% hn.kd.ny.adsl
    6 11284 0.76% 0 0.00% 204741 0.81% 0 0.00% 113.128.9.138
    7 11526 0.78% 0 0.00% 149724 0.59% 0 0.00% 121.29.126.70
    8 3686 0.25% 0 0.00% 139829 0.55% 0 0.00% 115.187.229.179
    9 7412 0.50% 0 0.00% 139187 0.55% 0 0.00% 115.218.107.247
    10 4103 0.28% 683 0.11% 134515 0.53% 107 0.16% spider-199-21-99-112.yandex.com
    Shane

  2. #2
    SitePoint Mentor silver trophybronze trophy
    Mikl's Avatar
    Join Date
    Dec 2011
    Location
    Edinburgh, Scotland
    Posts
    1,538
    Mentioned
    63 Post(s)
    Tagged
    0 Thread(s)
    Are these legitimate bots, like search engines, etc? If so, then the obvious solution is to use robots.txt to block them.

    But that won't help if they are some sort of malware. In that case, are they coming from any particular country? If so, you could consider blocking the entire range of IP addresses, but then you'd also be blocking legitimate visitors from that country.

    Maybe somebody else will have a better suggestion.

    Mike

  3. #3
    Life is not a malfunction gold trophysilver trophybronze trophy
    TechnoBear's Avatar
    Join Date
    Jun 2011
    Location
    Argyll, Scotland
    Posts
    6,057
    Mentioned
    253 Post(s)
    Tagged
    5 Thread(s)
    You could try something like Crawl Protect.

  4. #4
    SitePoint Mentor bronze trophy
    John_Betong's Avatar
    Join Date
    Aug 2005
    Location
    City of Angels
    Posts
    1,804
    Mentioned
    73 Post(s)
    Tagged
    6 Thread(s)
    Try this in your robots.txt file: - it worked for me when I had Gigabytes of Russian bots

    Crawl-delay: 10


    # Rule ignored by Googlebot"

    Google accepts the rule but ignores it

  5. #5
    SitePoint Mentor silver trophybronze trophy
    Mikl's Avatar
    Join Date
    Dec 2011
    Location
    Edinburgh, Scotland
    Posts
    1,538
    Mentioned
    63 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by John_Betong View Post
    Try this in your robots.txt file: - it worked for me when I had Gigabytes of Russian bots
    The problem with robots.txt (as I mentioned earlier) is that, if the bots are malignant, they won't take any notice of it.

    Mike

  6. #6
    SitePoint Mentor bronze trophy
    John_Betong's Avatar
    Join Date
    Aug 2005
    Location
    City of Angels
    Posts
    1,804
    Mentioned
    73 Post(s)
    Tagged
    6 Thread(s)
    Quote Originally Posted by Mikl View Post
    The problem with robots.txt (as I mentioned earlier) is that, if the bots are malignant, they won't take any notice of it.

    Mike
    I forgot to mention that the "Gigabytes of robots" accessing my site was on a daily basis.

    Does this come in the malignant category?

  7. #7
    SitePoint Mentor silver trophybronze trophy
    Mikl's Avatar
    Join Date
    Dec 2011
    Location
    Edinburgh, Scotland
    Posts
    1,538
    Mentioned
    63 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by John_Betong View Post
    I forgot to mention that the "Gigabytes of robots" accessing my site was on a daily basis.

    Does this come in the malignant category?
    I don't know what you mean by "Gigabytes of robots". In general, iIf the bot comes from a reputable company, like Google or Alexa, then it will respect robots.txt. These bots are generally well-behaved, and won't cause any problems with your hosting.

    But if the bot has some nefarious purpose, like harvesting email addresses, then it won't take any notice of robots.txt, and you'll have to find some other way of blocking it.

    Mike

  8. #8
    SitePoint Wizard
    Join Date
    Oct 2005
    Posts
    1,832
    Mentioned
    5 Post(s)
    Tagged
    1 Thread(s)
    More and more bots such as brandwatch.net are crawling sites looking for things said about clients. They can suck down a lot of data and use a lot of resources. robots.txt isn't going to block them. Your best bet is to use htaccess.

    There are a couple of ways you can do it. You can block the bots using their user agent string. I have had success with this but have not been able to block Baiduspider no matter how many different permutations I have tried in htaccess. I have blocked the rest of the bots, though. When they visit they get a 403 Forbidden error page. I put this in htaccess:

    #Block bots.
    SetEnvIfNoCase User-Agent "^baiduspider" bad_bot
    SetEnvIfNoCase User-Agent "^baidu" bad_bot
    SetEnvIfNoCase User-Agent "^baidu*" bad_bot
    SetEnvIfNoCase User-Agent "^Baiduspider/2.0" bad_bot
    SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
    SetEnvIfNoCase User-Agent "^YandexBot" bad_bot
    SetEnvIfNoCase User-Agent "^magpie-crawler" bad_bot

    Order Allow,Deny
    Allow from all
    Deny from env=bad_bot

    magpie-crawler is brandwatch.net.

    As I said, I have not been successful at blocking Baidu using this method but have blocked everything else. I'm going to have to resort to using an IP address range to block Baidu because the user agent string isn't working.

    Another method you can use is to rewrite based on the user agent string as described here:

    http://www.spanishseo.org/block-spam-bots-scrapers

    I don't know how efficient that is.

    Also see this for more ideas:

    http://www.askapache.com/htaccess/setenvif.html

  9. #9
    SitePoint Member
    Join Date
    Aug 2010
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Shane Is My Name View Post
    im getting warnings from my shared host (lunarpages). saying my cpu & memory usage is high. im guessing they want me off shared. but looks like a lot of the traffic is not even from humans... how do you prevent the bots from accessing 53,000 hits? lol. i know I can ip deny individual bots, but I keep doing that, and then lunarpages sends me another list of the new top ten ips... is this just something everyone deals with?


    CPU Usage - %7.53
    MEM Usage - %1.23
    Number of MySQL procs (average) - 0.14
    Top Process %CPU 66.00 [php]
    Top Process %CPU 65.00 [php]
    Top Process %CPU 31.50 [php]


    Top 10 of 88026 Total Sites By KBytes
    # Hits Files KBytes Visits Hostname
    1 53683 3.61% 0 0.00% 703394 2.79% 0 0.00% 113.128.7.109
    2 24799 1.67% 0 0.00% 421146 1.67% 0 0.00% 219.139.116.166
    3 30716 2.07% 0 0.00% 414266 1.64% 0 0.00% 113.128.31.117
    4 15159 1.02% 0 0.00% 270894 1.07% 0 0.00% 119.130.163.150
    5 15309 1.03% 1 0.00% 259043 1.03% 1 0.00% hn.kd.ny.adsl
    6 11284 0.76% 0 0.00% 204741 0.81% 0 0.00% 113.128.9.138
    7 11526 0.78% 0 0.00% 149724 0.59% 0 0.00% 121.29.126.70
    8 3686 0.25% 0 0.00% 139829 0.55% 0 0.00% 115.187.229.179
    9 7412 0.50% 0 0.00% 139187 0.55% 0 0.00% 115.218.107.247
    10 4103 0.28% 683 0.11% 134515 0.53% 107 0.16% spider-199-21-99-112.yandex.com
    If your web server is Apache:
    Quote from http://www.uk-cheapest.co.uk/blog/2010/11/how-do-i-block-the-baiduspider-from-crawling-my-site/ :
    "You can easily disable the Baidu spider by placing the following in your .htaccess file:

    BrowserMatchNoCase Baiduspider bad_bot
    Deny from env=bad_bot

    Using this method saves you the trouble of having to find blocks of Baidu IP addresses and block them individually. "

    However,
    since it seems you have PHP on your server and
    since Baidu ignores your robots.txt,
    how about a reverse DOS?

    I block Baidu differently on my web server. When Baidu requests
    my default web page, some PHP code delays this page request for
    999 seconds. This keeps one IP socket busy on both my server and
    the Baidu server until the default IP 'timeout error' occurs. This keeps
    Baidu from bothering other web servers for a minute or so. This is,
    kind of, a reverse DOS attack on Baidu. Zero (0) bytes are transferred.
    http://gelm.net/How-to-block-Baidu-with-PHP.htm

  10. #10
    SitePoint Wizard TheRedDevil's Avatar
    Join Date
    Sep 2004
    Location
    Norway
    Posts
    1,196
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by gelmce View Post
    I block Baidu differently on my web server. When Baidu requests
    my default web page, some PHP code delays this page request for
    999 seconds. This keeps one IP socket busy on both my server and
    the Baidu server until the default IP 'timeout error' occurs. This keeps
    Baidu from bothering other web servers for a minute or so. This is,
    kind of, a reverse DOS attack on Baidu. Zero (0) bytes are transferred.
    http://gelm.net/How-to-block-Baidu-with-PHP.htm
    Be VERY careful with this. I would never recommend this as an option, instead you should block it on the firewall level. Though this is of course not that easy when your on a shared server.

    The reason I do not recommend using sleep() with PHP (or any other language for that matter) in this case, is due to if someone finds out that you did this it makes it very easy to take your server "down", i.e. use up all of the available web server threads and by that denying access to the server for real customers.

  11. #11
    ¬.¬ shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    If you ask me...Lunarpages' network is managed by idiots. If they had any idea of what there were doing they would block traffic that is causing high loads on their servers. They can easily do with with any good firewall. Example a real human would not make a dozen connections in the span of a few seconds, thus you can drop connections that are making too many connections within a certain threshold.
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.



Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •