SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Member
    Join Date
    Apr 2012
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    robots.txt - exclude any URL that contains /node/

    0 down vote favorite
    share [g+] share [fb] share [tw]


    How do I tell crawlers / bots not to index any URL that has /node/ pattern? Following is since day one but I noticed that Google has still indexed a lot of URLs that has /node/ in it, e.g. www.mywebsite.com/node/123/32

    Disallow: /node/

    Is there anything that states that do not index any URL that has /node/ Should I write something like following: Disallow: /node/*

    Thanks
    Last edited by TechnoBear; Apr 13, 2012 at 05:48. Reason: Example URL delinkified

  2. #2
    SitePoint Member
    Join Date
    Apr 2010
    Location
    India
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It's very simple to follow this code.
    User-agent: *
    Disallow: / Page URL

    write the page URL which you did't want to crawl by Google.

  3. #3
    SitePoint Member
    Join Date
    Apr 2012
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi Steev,

    It's just not possible, there are so many pages, only a pattern based restriction is required.

    Regards

  4. #4
    Life is not a malfunction gold trophysilver trophybronze trophy
    TechnoBear's Avatar
    Join Date
    Jun 2011
    Location
    Argyll, Scotland
    Posts
    6,423
    Mentioned
    274 Post(s)
    Tagged
    5 Thread(s)
    Disallow: /node/ is the correct syntax to disallow crawling of a directory called "node". Disallow: /node/* is incorrect. You can find full details of how to use a robots.txt file here.

    I have read somewhere - I no longer have any idea where - that Google likes to be addressed personally, so you write one version for Google and one for everybody else. e.g.
    Code:
    # For Googlebot
    User-agent: Googlebot
    Disallow: /cgi-bin/
    Disallow: /scripts/
    
    # For all bots
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /scripts/
    I've no idea how reliable that information is, but for the sake of a few bytes, it does no harm to include it. It works for me.

  5. #5
    Mouse catcher silver trophy Stevie D's Avatar
    Join Date
    Mar 2006
    Location
    Yorkshire, UK
    Posts
    5,892
    Mentioned
    123 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by iamopensource View Post
    How do I tell crawlers / bots not to index any URL that has /node/ pattern? Following is since day one but I noticed that Google has still indexed a lot of URLs that has /node/ in it, e.g. www.mywebsite.com/node/123/32

    Disallow: /node/
    That will work for any URL of the form example.com/node/whatever, but it won't work for example.com/something/node/whatever ... is that a problem?

    Quote Originally Posted by TechnoBear View Post
    I have read somewhere - I no longer have any idea where - that Google likes to be addressed personally, so you write one version for Google and one for everybody else. e.g.
    Code:
    # For Googlebot
    User-agent: Googlebot
    Disallow: /cgi-bin/
    Disallow: /scripts/
    
    # For all bots
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /scripts/
    I've no idea how reliable that information is, but for the sake of a few bytes, it does no harm to include it. It works for me.
    I've not seen anything like that. What you can do is to give Googlebot (or any other bot) different restrictions by specifying those first and then doing a catch-all * for 'all others'.

  6. #6
    SitePoint Member
    Join Date
    Apr 2012
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That will work for any URL of the form example.com/node/whatever, but it won't work for example.com/something/node/whatever ... is that a problem?
    The real problem is despite:
    Disallow: /node/
    in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32

    /node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content
    Last edited by TechnoBear; Apr 14, 2012 at 02:36. Reason: Example URL delinkified

  7. #7
    SitePoint Zealot topgrade's Avatar
    Join Date
    Jun 2007
    Posts
    171
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can write the line using wildcards like the one below:

    Disallow: */node/*
    DoFollow Backlink Checker | Internet Marketing and SEO Forums

    22,000+ List of Directories to submit your site. List of Blogs, Forums, Press Release, Social Media... (sort by PR & Alexa)

  8. #8
    SitePoint Member
    Join Date
    Apr 2012
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks topgrade, this is a syntax error, please check:
    Disallow: */node/*
    at:
    http://www.searchenginepromotionhelp...ts-checker.php

  9. #9
    Life is not a malfunction gold trophysilver trophybronze trophy
    TechnoBear's Avatar
    Join Date
    Jun 2011
    Location
    Argyll, Scotland
    Posts
    6,423
    Mentioned
    274 Post(s)
    Tagged
    5 Thread(s)
    Quote Originally Posted by topgrade View Post
    You can write the line using wildcards like the one below:

    Disallow: */node/*
    You can, but don't expect it to work. If you follow the link I gave above, you'll find:
    Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

  10. #10
    Mouse catcher silver trophy Stevie D's Avatar
    Join Date
    Mar 2006
    Location
    Yorkshire, UK
    Posts
    5,892
    Mentioned
    123 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by iamopensource View Post
    The real problem is despite:
    Disallow: /node/
    in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32

    /node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content
    I'm still not seeing any problem. Google couldn't care whether it's a physical directory or not – all it will do is pattern-match the URL, and if it matches the pattern then bingo, it will not send Googlebot down that road.

  11. #11
    SitePoint Member
    Join Date
    Apr 2012
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi Stevie,

    I checked google as:
    site:www.mywebsite.com inurl:node and it gives me hundreds of results,

    e.g. http://www.mywebsite.com/node/193

    does this mean Google not respecting robots.txt?

    my robots.txt exists since day 1.
    Last edited by TechnoBear; Apr 16, 2012 at 04:48. Reason: Example URLs delinkified


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •