Blog Post RSS ?

Blogs » Search Engine Marketing » Google’s Hidden Protocol
 

Google’s Hidden Protocol

by Dan Thies

Google’s URL removal page contains a little bit of handy information that’s not found on their webmaster info pages where it should be.

Google supports the use of “wildcards” in robots.txt files. This isn’t part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:

User-agent: Googlebot
Disallow: /*sort=

This would stop Googlebot from reading any URL that included the string “sort=” no matter where that string occurs in the URL.

So if you have a shopping cart, and use a variable called “sort” in some URLs, you can stop Googlebot from reading the sorted (but basically duplicate) content that your site produces for users.

Every search engine should support this. It would make real life a lot easier for folks with dynamic sites, and artificial life a lot easier for spiders.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • Ping.fm
  • TwitThis

This post has 12 responses so far

  1. great find! thanx

     
  2. Thanks Dan

     
  3. useful information. thanks!

     
  4. Is this a must add for user using sites like OSCommerce? Is the Sort feature a real cause for duplicate content?

     
  5. It could be. Picture a category with one product - no matter how you sort it, it’s the same page. Even with a bunch of items, do you really need the search engines to have every possible order?

     
  6. I’ve been using wildcards in robots.txt for…well ever. I had no idea it wasn’t part of the original protocol and I certainly had no idea it was only Google that supports it. Thanks for the info.

     
  7. [...] Dan Thies has found a neat hidden protocol that can be used on your robots.txt file: The Wildcard. User-agent: Googlebot Disallow: /*sort= [...]

     
  8. [...] I have been using wildcard in my robots.txt file for…ever (this is thedefinition of robots.txt, if you need it). Well, this morning, I happened upon Dan Thies’ Google’s Hidden Protocol post at SitePoint and I thought hmm…interesting. [...]

     
  9. If there is any url in my site containing the word ‘calender’ and I don’t want google to index it than I wil just add

    User-agent: Googlebot
    Disallow: /*calender

    to my robots file,

    and it does not matter that under which directory the url with word ‘calender’ is coming, it might be my cg-bin directory.

    right ?

     
  10. [...] SEO Book points to Dan Thies’s finding of a useful but non-standard robots.txt feature supported by the Google spider: [...]

     
  11. That helps a lot. Thanks for the help!

     
  12. I did this once and at page removal request i got an error for the same.
    SEO Services

     

Sponsored Links

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Follow us on Twitter