More and more bots such as brandwatch.net are crawling sites looking for things said about clients. They can suck down a lot of data and use a lot of resources. robots.txt isn't going to block them. Your best bet is to use htaccess.
There are a couple of ways you can do it. You can block the bots using their user agent string. I have had success with this but have not been able to block Baiduspider no matter how many different permutations I have tried in htaccess. I have blocked the rest of the bots, though. When they visit they get a 403 Forbidden error page. I put this in htaccess:
SetEnvIfNoCase User-Agent "^baiduspider" bad_bot
SetEnvIfNoCase User-Agent "^baidu" bad_bot
SetEnvIfNoCase User-Agent "^baidu*" bad_bot
SetEnvIfNoCase User-Agent "^Baiduspider/2.0" bad_bot
SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
SetEnvIfNoCase User-Agent "^YandexBot" bad_bot
SetEnvIfNoCase User-Agent "^magpie-crawler" bad_bot
Allow from all
Deny from env=bad_bot
magpie-crawler is brandwatch.net.
As I said, I have not been successful at blocking Baidu using this method but have blocked everything else. I'm going to have to resort to using an IP address range to block Baidu because the user agent string isn't working.
Another method you can use is to rewrite based on the user agent string as described here:
I don't know how efficient that is.
Also see this for more ideas: