SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Member
    Join Date
    Nov 2013
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Blocking Bots still getting through?

    Currently am blocking bots that try to showcase backlinks such as majestic and ahrefs but yet they are still appearing in their search data. Anybody have a good current list of bots to block from showing off your linky links?

  2. #2
    SitePoint Zealot Kronomia's Avatar
    Join Date
    Oct 2013
    Posts
    100
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)

    Post WhiteList of Bots - Complete Guide

    It is much better to whitelist the good bots instead of blocking bad bots one-by-one. My personal complete guide of good bots is mentioned below.

    Comprehensive Guide of Good Bots to WhiteList:

    Code:
    User-agent: googlebot
    Disallow:
    
    User-agent: googlebot-mobile
    Disallow:
    
    User-agent: googlebot-image
    Disallow:
    
    User-agent: bingbot
    Disallow:
    
    User-agent: msnbot
    Disallow:
    
    User-agent: slurp
    Disallow:
    
    User-agent: Teoma
    Disallow:
    
    User-agent: yandex
    Disallow:
    
    User-agent: sogou
    Disallow:
    
    User-agent: baiduspider
    Disallow:
    
    User-agent: exabot
    Disallow:
    
    User-agent: gigabot
    Disallow:
    
    User-agent: facebookexternalhit
    Disallow:
    
    User-agent: twiceler
    Disallow:
    
    User-agent: scrubby
    Disallow:
    
    User-agent: robozilla
    Disallow:
    
    User-agent: nutch
    Disallow:
    
    User-agent: ia_archiver
    Disallow:
    
    User-agent: baiduspider
    Disallow:
    
    User-agent: naverbot
    Disallow:
    
    User-agent: yeti
    Disallow:
    
    User-agent: yahoo-mmcrawler
    Disallow:
    
    User-agent: yahoo-blogs/v3.9
    Disallow:
    
    User-agent: psbot
    Disallow:
    
    User-agent: asterias
    Disallow:
    
    User-agent: java
    Disallow:
    
    User-agent: wget
    Disallow:
    
    User-agent: curl
    Disallow:
    
    User-agent: commons-httpclient
    Disallow:
    
    User-agent: python-urllib
    Disallow:
    
    User-agent: libwww
    Disallow:
    
    User-agent: httpunit
    Disallow:
    
    User-agent: phpcrawl
    Disallow:
    
    User-agent: *
    Disallow: /
    Disallow: /cgi-bin/
    There are also some plugins on WordPress that blocks a large portion of those bad bots. The best effective technique though is to whitelist the decent ones.
    Remember that not all bots obey your robots.txt file thus you might need to block them before they reach your website.

    There is a nice technique for identify bad bots called "Bad Bots Blackhole", which is a script that traps those bad bots that disobey your robots.txt file. You can google it online to find more about it.

    Hope that helps!
    Last edited by Kronomia; Nov 12, 2013 at 01:14. Reason: Correcting Typing Mistakes

  3. #3
    SitePoint Member
    Join Date
    Nov 2013
    Posts
    5
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Kronomia View Post
    It is much better to whitelist the good bots instead of blocking bad bots one-by-one. My personal complete guide of good bots is mentioned below.

    Comprehensive Guide of Good Bots to WhiteList:

    Code:
    User-agent: googlebot
    Disallow:
    
    User-agent: googlebot-mobile
    Disallow:
    
    User-agent: googlebot-image
    Disallow:
    
    User-agent: bingbot
    Disallow:
    
    User-agent: msnbot
    Disallow:
    
    User-agent: slurp
    Disallow:
    
    User-agent: Teoma
    Disallow:
    
    User-agent: yandex
    Disallow:
    
    User-agent: sogou
    Disallow:
    
    User-agent: baiduspider
    Disallow:
    
    User-agent: exabot
    Disallow:
    
    User-agent: gigabot
    Disallow:
    
    User-agent: facebookexternalhit
    Disallow:
    
    User-agent: twiceler
    Disallow:
    
    User-agent: scrubby
    Disallow:
    
    User-agent: robozilla
    Disallow:
    
    User-agent: nutch
    Disallow:
    
    User-agent: ia_archiver
    Disallow:
    
    User-agent: baiduspider
    Disallow:
    
    User-agent: naverbot
    Disallow:
    
    User-agent: yeti
    Disallow:
    
    User-agent: yahoo-mmcrawler
    Disallow:
    
    User-agent: yahoo-blogs/v3.9
    Disallow:
    
    User-agent: psbot
    Disallow:
    
    User-agent: asterias
    Disallow:
    
    User-agent: java
    Disallow:
    
    User-agent: wget
    Disallow:
    
    User-agent: curl
    Disallow:
    
    User-agent: commons-httpclient
    Disallow:
    
    User-agent: python-urllib
    Disallow:
    
    User-agent: libwww
    Disallow:
    
    User-agent: httpunit
    Disallow:
    
    User-agent: phpcrawl
    Disallow:
    
    User-agent: *
    Disallow: /
    Disallow: /cgi-bin/
    There are also some plugins on WordPress that blocks a large portion of those bad bots. The best effective technique though is to whitelist the decent ones.
    Remember that not all bots obey your robots.txt file thus you might need to block them before they reach your website.

    There is a nice technique for identify bad bots called "Bad Bots Blackhole", which is a script that traps those bad bots that disobey your robots.txt file. You can google it online to find more about it.

    Hope that helps!
    Very very nice thank you. What about blocking ahrefs and majestic though?

  4. #4
    SitePoint Zealot Kronomia's Avatar
    Join Date
    Oct 2013
    Posts
    100
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by nuprofile View Post
    Very very nice thank you. What about blocking ahrefs and majestic though?
    We are whitelisting not blacklisting! so they will get blocked automatically.

    I am glad that was helpful.

    Good luck with your projects and thanks for your kind words.

  5. #5
    SitePoint Mentor silver trophybronze trophy
    Mikl's Avatar
    Join Date
    Dec 2011
    Location
    Edinburgh, Scotland
    Posts
    1,607
    Mentioned
    66 Post(s)
    Tagged
    0 Thread(s)
    Unfortunately, if the bots you are trying to block are in any way malicious (which is what I undersood from the original question), then robots.txt is not going to block them. Robots.txt is a voluntary protocol. In other words, it requests a bot not to visit the site. But it can't prevent it from doing so. By definition, a malicious bot can and will ignore it.

    It's much better to use .Htaccess to do the blocking, if your servers upports it; or an equivalent method if it doesn't.

    Mike


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •