Prevent crawling

gulliver · February 16, 2011, 10:56am

Am I right in thinking that the only way to reliably have a page not crawlable is to ‘orphan it’ - have it totally unlinked - as ‘bad’ robots will simply ignore a robots.txt file?

TimIgoe · February 16, 2011, 11:35am

Totally unlinked links can still be found (daft as it sounds).

The best way to stop something being spidered, is to use a .htaccess / .htpassword login to only allow users who should be able to see a page to see it.

gulliver · February 16, 2011, 11:46am

Thanks.

How can unlinked pages still be found?

So the .htaccess will prevent a robot spidering the page?

TimIgoe · February 16, 2011, 12:26pm

I’m not entirely sure if some sniffing goes on, for example if i send a link via msn, does that leak out to bing?

Combined with various scanners intercepting url traffic to scan it…

I don’t know how it does get out, but they can and do. Completely private urls aren’t unless you lock them down.

wwb_99 · February 16, 2011, 5:23pm

If it shouldn’t be out there at all, you really need authentication on the resource or said resource should not be on the internet. Period.

If you want to take reasonable efforts not to have it crawled, you could do some user agent or IP sniffing on the server-side to stop popular robots. But that won’t stop them all.

SSL can also help, at least with some robots – the overhead of dealing with it tends to slow down the crawl.

Topic		Replies	Views
How to Block Web Spiders/Crawlers Marketing	14	3555	September 28, 2014
How to stop the Unknown robots away from the website? Marketing	15	5829	January 29, 2010
Help! Stop individual folder being crawled and indexed Server Config security	4	1047	October 11, 2012
Problem with robots.txt file coding? Marketing	2	410	October 8, 2014
Prevent Page from being Indexed HTML & CSS	17	2060	October 8, 2014

Prevent crawling

Related topics