Blog Post RSS ?

Blogs » Search Engine Marketing » Google’s Hidden Protocol
 

Google’s Hidden Protocol


  • Save to
    Del.icio.us

by Dan Thies

Google’s URL removal page contains a little bit of handy information that’s not found on their webmaster info pages where it should be.

Google supports the use of “wildcards” in robots.txt files. This isn’t part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:

User-agent: Googlebot
Disallow: /*sort=

This would stop Googlebot from reading any URL that included the string “sort=” no matter where that string occurs in the URL.

So if you have a shopping cart, and use a variable called “sort” in some URLs, you can stop Googlebot from reading the sorted (but basically duplicate) content that your site produces for users.

Every search engine should support this. It would make real life a lot easier for folks with dynamic sites, and artificial life a lot easier for spiders.

This post has 11 responses so far

  1. great find! thanx

     
  2. Thanks Dan

     
  3. useful information. thanks!

     
  4. Is this a must add for user using sites like OSCommerce? Is the Sort feature a real cause for duplicate content?

     
  5. It could be. Picture a category with one product - no matter how you sort it, it’s the same page. Even with a bunch of items, do you really need the search engines to have every possible order?

     
  6. I’ve been using wildcards in robots.txt for…well ever. I had no idea it wasn’t part of the original protocol and I certainly had no idea it was only Google that supports it. Thanks for the info.

     
  7. […] Dan Thies has found a neat hidden protocol that can be used on your robots.txt file: The Wildcard. User-agent: Googlebot Disallow: /*sort= […]

     
  8. […] I have been using wildcard in my robots.txt file for…ever (this is thedefinition of robots.txt, if you need it). Well, this morning, I happened upon Dan Thies’ Google’s Hidden Protocol post at SitePoint and I thought hmm…interesting. […]

     
  9. If there is any url in my site containing the word ‘calender’ and I don’t want google to index it than I wil just add

    User-agent: Googlebot
    Disallow: /*calender

    to my robots file,

    and it does not matter that under which directory the url with word ‘calender’ is coming, it might be my cg-bin directory.

    right ?

     
  10. […] SEO Book points to Dan Thies’s finding of a useful but non-standard robots.txt feature supported by the Google spider: […]

     
  11. That helps a lot. Thanks for the help!

     

Sponsored Links

Leave a response

You are not logged in, log in with your SitePoint Forum username and password.

-OR- Post Anonymously

* Make sure any code samples are escaped (i.e. ‘<b>’ becomes ‘&lt;b&gt;’).

If not logged in, your comments will be placed in a moderation queue. This means your comment may not appear until one of our moderators approves it.

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Logo Design, Web page Design and more!

99designs

  • Custom logo designs created ‘just for you’.
  • Pick the design you like best.
  • Only pay if you’re satisfied with the result.

It's Back!
FREE PDF with any printed book!