SitePoint Sponsor

User Tag List

Results 1 to 18 of 18
  1. #1
    Photo Adventurer DebNCgal's Avatar
    Join Date
    Sep 2005
    Location
    Lewisvlle, NC (USA)
    Posts
    300
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy Anyone Dealt with the Baiduspider Bot?

    A few days ago, I installed the WordPress Global Translator plugin. Since then I've noticed a lot of new spiders/bots coming in, which I suppose is normal.

    However, one bot in particular, the Baiduspider bot, is disregarding the robots.txt instructions by going where it should not go. The disallow instructions I placed in the robots.txt file for the bot don't work, either. The bot is also using a number of different IP addresses, so I'm not sure the IP address could be used to deny it access.

    I've read that Baiduspider is a search engine from China. One of the translations I set up with the plugin was the Simplified Chinese translation.

    I hope Baiduspider is not scraping my site, but it seems to be visiting every nook and cranny, including image folders. My site is a photo blog, so I'm concerned about my blog photos getting swiped on a large scale by this bot.

    Every time I check the "Latest Visitors" section of my CPanel, Baiduspider is either currently on my site or has recently been back. It's been coming and going a lot within the last few days.

    Can anyone offer some insight on the Baiduspider bot and what, if anything, can or should be done to deny it access to my site? I'd like to think it's a harmless bot. Even is it is harmless, I still don't like the fact that it's disregarding the robots.txt file. Should I be concerned about this bot?

    Thanks for any assistance.

    Deb
    Deb Phillips
    The Photo Gal

  2. #2
    SitePoint Zealot zealus's Avatar
    Join Date
    Jan 2004
    Location
    NY
    Posts
    132
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It's a search bot originated by Chinese search engine Baidu.

    Robot Name: BaiDuSpider
    Agent_String: Baiduspider+(+http://www.baidu.com/search/spider.htm)
    URL: http://www.baidu.com/search/spider.htm
    IP Addr: 220.181.32.11 220.181.32.16 220.181.32.22 220.181.32.49 220.181.32.51 220.181.32.64 220.181.32.68 220.181.32.98 220.181.50.207 220.181.50.220 61.135.168.131 61.135.168.14 61.135.168.173 61.135.168.39

    More information can be found here: http://www.useragentstring.com/pages/Baiduspider/

    You can ban IP addresses on your server/domain to prevent Baidu from indexing your web site. However, if you have no problem with Google indexing your picture I can hardly understand why would you have a problem with Baidu.

  3. #3
    Photo Adventurer DebNCgal's Avatar
    Join Date
    Sep 2005
    Location
    Lewisvlle, NC (USA)
    Posts
    300
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for all the specific info, zealus, especially for all the IP addresses.

    I actually don't allow Google to index my images. I don't mind being indexed by Baiduspider, but it's set to do whatever it wishes, with no regard to the robots.txt file. Google and a few other bots, on the other hand, at least abide by the robots.txt file.

    I guess it's just the typical battle-of-the-bots world a love/hate relationship!

    Thank you!
    Deb Phillips
    The Photo Gal

  4. #4
    SitePoint Addict ameRie's Avatar
    Join Date
    Jul 2007
    Location
    currently in South East
    Posts
    284
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I haven't heard this yet. It's funny this baidubots disregard robots.txt file? am I right?

  5. #5
    Photo Adventurer DebNCgal's Avatar
    Join Date
    Sep 2005
    Location
    Lewisvlle, NC (USA)
    Posts
    300
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, let me back up a bit. After reviewing my robots.txt file, the Baiduspider doesn't appear to have disregarded the specific disallows I had in place when the bot first started coming by. However, as of a few days ago, I inserted the following into the robots.txt file:

    Code:
    User-agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)
    Disallow: /
    I just checked the Last Visitors panel of my CPanel, though, and there were several instances of: "Agent: Baiduspider+(+http://www.baidu.com/search/spider.htm)" being on the site within the last few hours. So unless I've incorrectly entered the above Baiduspider disallow in my robots.txt file, it looks to me that at least the disallow statement completely banning the bot from the site is being disregarded by the bot because it's clearly still coming by.

    I don't really think it's worth my time to attempt to block Baiduspider via my .htaccess file by trying to account for all the scores of IP addresses that are listed at http://www.useragentstring.com/Baiduspider_id_248.php for this specific version of the Baiduspider.

    But perhaps, out of a degree of ignorance, I'm making more of this than I should I'm not sure. If I were to succeed at disallowing the bot, would my site not be indexed at all for China?

    However, I don't allow the Gooblebot-Image bot to index the individual photos on my site. And my fear is that Baiduspider might be doing that.

    I admit, this area is somewhat new territory for me, so any corrective thoughts are appreciated. The only thing I do know is that there is an incredible amount of activity going on by the Baiduspider bot, and it has raised some question marks for me.

    Thanks.
    Deb Phillips
    The Photo Gal

  6. #6
    SitePoint Zealot ~kev~'s Avatar
    Join Date
    Jul 2008
    Location
    East Texas
    Posts
    139
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by DebNCgal View Post
    I hope Baiduspider is not scraping my site, but it seems to be visiting every nook and cranny, including image folders. My site is a photo blog, so I'm concerned about my blog photos getting swiped on a large scale by this bot.
    Put a blank index.html file in the images folder. This will block the public viewing of the folder. In the header of the blank index.html file, put something like robots noindex - or what ever the command is.

  7. #7
    SitePoint Member
    Join Date
    Apr 2009
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It sounds like very well . But I want to get a chinese translationsoftware. Can you help me? Thanks million.
    very long day chinese translation

  8. #8
    SitePoint Member
    Join Date
    Dec 2009
    Posts
    4
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Angry

    Hey I know what you mean. Just this morning 72 freaking baidu bots attacked my site. I don't mind the attention ,but I dont think its needed because they all look trough the same pages. im thinking about doing a '*' as a wild card to ban those rascal's I.P. e.g. 220.181.7*. I need to save bandwidth for actual people

  9. #9
    SitePoint Zealot
    Join Date
    May 2009
    Location
    Singapore
    Posts
    157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You will be facing plagiarizm problems soon. The Chinese in mainland China does not think that plagiarizm is a big deal.

  10. #10
    SitePoint Member
    Join Date
    Jan 2010
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    DebNCgal

    I too have received many visits from baidu. Every day there are between twelve and fifteen hits; always in pairs, sometimes three at a time. Most IP addresses start with either 123 or 220; a few start with 119. One of the two main IPs always gets a 404 code while the other(s) get a 200 code.

    Some time back baidu seemed to be taking my photos (my site has hundreds of photographs). I turned on hotlink protection in CPanel. Since then, baidu only checks the existence of my site but does not crawl pages at all.

    I am currently compiling my own list of IP addresses that baidu uses. I plan to block them all once there are no new numbers on the list.

  11. #11
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    I believe the user agent you should be using for this spider in robots.txt is simply "Baiduspider". Not the full user agent string. Give it a try.

    http://www.baidu.com/robots.txt

    No different than Google, which asks you use "Googlebot", not "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

  12. #12
    SitePoint Member
    Join Date
    Jan 2010
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Dan,

    I appreciate the advice. However, baidu is no longer scanning any files in my site -- ever since I enabled Hotlink Protection. What it is doing is filling up my stats with visits to " / ". Plus every second or third visit gets the 404 file which skews the information I am working with for monitoring my website. As much trouble as it will be to block all of the various IP addresses, I think that is what I will do. Unless you can suggest some other alternative.

  13. #13
    SitePoint Member
    Join Date
    Aug 2010
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    "Can anyone offer some insight on the Baiduspider bot and what, if anything, can or should be done to deny it access to my site?"

    Baidu should be denied access to your server and below is a suggestion of
    what you should do.

    If your web server is Apache, you can return a '403 Forbidden' error message by
    editing your .htaccess file in the root of your server path. e.g.
    Order allow,deny
    Deny from 119.63.192.0/21
    Deny from 123.122.0.0/20
    Deny from 220.181.0.0/16
    Allow from all

    Even better, if you have PHP on your web server, is to make Baidu wait
    up to 999 seconds for a page request.
    See: http://gelm.net/How-to-block-Baidu-with-PHP.htm

  14. #14
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    127
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I like the 999 second thing.
    I've been thinking about this A LOT lately, IMO searching is content theft technically. A bot's adherence to your robots file is the only thing that would make it even remotely acceptable in my eyes. So yeah, I think tie the little blighter up.

    Those little bot buggers eat up bandwidth, especially on sites run on out of box CMS where like 90% of the files are useless and never seen by the real users (but I'm way too lazy to sift through, or put in a hideously long robots file because there is no "Allow" functionality)
    Patriotism is the virtue of the vicious.

  15. #15
    SitePoint Zealot
    Join Date
    Jul 2010
    Posts
    100
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Okay, the thing that must be done is to verify anything before downloading it or installing it. inform yourself first about the "secundary effects" of the programs, so you won't have problems.

  16. #16
    SitePoint Enthusiast
    Join Date
    Aug 2009
    Posts
    98
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    China does things differently on the internet.

    Baidu will gather information about your site then it will figure out if it's a site they will allow or disallow themselves. Baidu is a search engine. It was the competitive one with Google when they were in town still, but now Baidu has taken over completely again. It still was #1 when Google was there anyways.

    Anyone know what baidu means in Chinese? Search or another type of random name like Google?

  17. #17
    SitePoint Member
    Join Date
    Oct 2008
    Posts
    20
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

  18. #18
    SitePoint Zealot
    Join Date
    Jan 2008
    Location
    Dublin, CA
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Can you show us what your robots.txt is reading? Maybe something was typed incorrectly?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •