Google’s Hidden Protocol

Google’s URL removal page contains a little bit of handy information that’s not found on their webmaster info pages where it should be.

Google supports the use of “wildcards” in robots.txt files. This isn’t part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:

User-agent: Googlebot
Disallow: /*sort=

This would stop Googlebot from reading any URL that included the string “sort=” no matter where that string occurs in the URL.

So if you have a shopping cart, and use a variable called “sort” in some URLs, you can stop Googlebot from reading the sorted (but basically duplicate) content that your site produces for users.

Every search engine should support this. It would make real life a lot easier for folks with dynamic sites, and artificial life a lot easier for spiders.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • http://www.designity.nl peach

    great find! thanx

  • http://www.e-calc.net Ogito

    Thanks Dan

  • http://www.imperu.net/ ronald_poi

    useful information. thanks!

  • Pura Vida

    Is this a must add for user using sites like OSCommerce? Is the Sort feature a real cause for duplicate content?

  • http://www.seoresearchlabs.com DanThies

    It could be. Picture a category with one product – no matter how you sort it, it’s the same page. Even with a bunch of items, do you really need the search engines to have every possible order?

  • http://boyohazard.net Octal

    I’ve been using wildcards in robots.txt for…well ever. I had no idea it wasn’t part of the original protocol and I certainly had no idea it was only Google that supports it. Thanks for the info.

  • Pingback: (EMP) E-Marketing Performance » : » Updating Your Robots.txt File

  • Pingback: GreatNexus Webmaster Blog » Blog Archive » Google supports wildcard in robots.txt

  • maxy22

    If there is any url in my site containing the word ‘calender’ and I don’t want google to index it than I wil just add

    User-agent: Googlebot
    Disallow: /*calender

    to my robots file,

    and it does not matter that under which directory the url with word ‘calender’ is coming, it might be my cg-bin directory.

    right ?

  • Pingback: SEO news » More Powerful Robots.txt Exclusion For Google

  • Anonymously

    That helps a lot. Thanks for the help!

  • http://www.rapidvectorseo.com rapidvectorseo

    I did this once and at page removal request i got an error for the same.
    SEO Services