Prevent crawling

Am I right in thinking that the only way to reliably have a page not crawlable is to ‘orphan it’ - have it totally unlinked - as ‘bad’ robots will simply ignore a robots.txt file?

Totally unlinked links can still be found (daft as it sounds).

The best way to stop something being spidered, is to use a .htaccess / .htpassword login to only allow users who should be able to see a page to see it.


How can unlinked pages still be found?

So the .htaccess will prevent a robot spidering the page?

I’m not entirely sure if some sniffing goes on, for example if i send a link via msn, does that leak out to bing?

Combined with various scanners intercepting url traffic to scan it…

I don’t know how it does get out, but they can and do. Completely private urls aren’t unless you lock them down.

If it shouldn’t be out there at all, you really need authentication on the resource or said resource should not be on the internet. Period.

If you want to take reasonable efforts not to have it crawled, you could do some user agent or IP sniffing on the server-side to stop popular robots. But that won’t stop them all.

SSL can also help, at least with some robots – the overhead of dealing with it tends to slow down the crawl.