Google’s Hidden Protocol
Google’s URL removal page contains a little bit of handy information that’s not found on their webmaster info pages where it should be.
Google supports the use of “wildcards” in robots.txt files. This isn’t part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:
User-agent: Googlebot Disallow: /*sort=
This would stop Googlebot from reading any URL that included the string “sort=” no matter where that string occurs in the URL.
So if you have a shopping cart, and use a variable called “sort” in some URLs, you can stop Googlebot from reading the sorted (but basically duplicate) content that your site produces for users.
Every search engine should support this. It would make real life a lot easier for folks with dynamic sites, and artificial life a lot easier for spiders.