Google’s URL removal page contains a little bit of handy information that’s not found on their webmaster info pages where it should be.
Google supports the use of “wildcards” in robots.txt files. This isn’t part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:
User-agent: Googlebot Disallow: /*sort=
This would stop Googlebot from reading any URL that included the string “sort=” no matter where that string occurs in the URL.
So if you have a shopping cart, and use a variable called “sort” in some URLs, you can stop Googlebot from reading the sorted (but basically duplicate) content that your site produces for users.
Every search engine should support this. It would make real life a lot easier for folks with dynamic sites, and artificial life a lot easier for spiders.






October 21st, 2005 at 3:55 pm
great find! thanx
October 22nd, 2005 at 4:52 pm
Thanks Dan
October 22nd, 2005 at 7:29 pm
useful information. thanks!
October 23rd, 2005 at 7:17 pm
Is this a must add for user using sites like OSCommerce? Is the Sort feature a real cause for duplicate content?
October 23rd, 2005 at 8:34 pm
It could be. Picture a category with one product - no matter how you sort it, it’s the same page. Even with a bunch of items, do you really need the search engines to have every possible order?
October 24th, 2005 at 7:22 am
I’ve been using wildcards in robots.txt for…well ever. I had no idea it wasn’t part of the original protocol and I certainly had no idea it was only Google that supports it. Thanks for the info.
October 24th, 2005 at 5:15 pm
[…] Dan Thies has found a neat hidden protocol that can be used on your robots.txt file: The Wildcard. User-agent: Googlebot Disallow: /*sort= […]
November 7th, 2005 at 8:48 am
[…] I have been using wildcard in my robots.txt file for…ever (this is thedefinition of robots.txt, if you need it). Well, this morning, I happened upon Dan Thies’ Google’s Hidden Protocol post at SitePoint and I thought hmm…interesting. […]
December 6th, 2005 at 12:33 pm
If there is any url in my site containing the word ‘calender’ and I don’t want google to index it than I wil just add
User-agent: Googlebot
Disallow: /*calender
to my robots file,
and it does not matter that under which directory the url with word ‘calender’ is coming, it might be my cg-bin directory.
right ?
July 11th, 2006 at 11:57 pm
[…] SEO Book points to Dan Thies’s finding of a useful but non-standard robots.txt feature supported by the Google spider: […]
January 26th, 2007 at 5:13 am
That helps a lot. Thanks for the help!