How to Block Web Spiders/Crawlers

How do I make it so that my website CANNOT be indexed on the Internet?

I do NOT want Google or Yahoo or any other search engine or The Internet Archive to index or store or archive any of my website.

The Internet Archive had a link on their page to [URL=“http://pageresource.com/zine/robotstxt.htm”]here which says to create a file called “robots.txt” in your Web Root and to place the following code inside of it…


User-agent: *
Disallow: / 


Is that really all I need to do??
:-/

It seems like that is a very old link/article (i.e. circa 1999) and I am wondering if I have to do more to block sites like Google??

Thanks,

Amy

Create Robot.txt file and upload it …this is the only way you can stop bot or crawler

Yes, that is the standard way. You need to create the robots.txt file.

You can also control this using meta tags, but the robots.txt is the best method.

If it’s absolutely crucial that your site isn’t indexed, you can always add some code to check the User-Agent field that is submitted to the web server, and check to see if it’s a known search engine crawler, and if so, don’t output any HTML.

Currently, I have the following robots.txt file in my Web Root…


User-agent: *
Disallow: /

Thanks,

Amy

Yah, I have that - see last post.

You can also control this using meta tags, but the robots.txt is the best method.

Like this???

    <meta name=”robots” content=”noindex”>

If it’s absolutely crucial that your site isn’t indexed, you can always add some code to check the User-Agent field that is submitted to the web server, and check to see if it’s a known search engine crawler, and if so, don’t output any HTML.

How exactly do you do that??

Amy

Checking the useragent for any purpose other than collecting statistics isn’t a good idea as any such attempted block can be easily bypassed simply be making a small change to the useragent (and you wouldn’t want to check thousands of search engine useragents in the first place).

Those two lines in the robots.txt file is all that is needed to stop legitimate search engines from indexing your site.

Also if you don’t have any incoming links from an indexed web page then no spiders/robots will find your site anyway. That’s why most search engines have indexed less than 10% of the web and even Google hasn’t found 20% of web pages yet.

Hello Stephen! Always a pleasure to read your posts!

Okay, I just woke up, and my brain is still warming up…

By incoming links you mean if “I” link to an outside website??

If so, then you are saying that just by having a domain and URL I will not end up on Google?

And that due to that fact and my simple robots.txt file, I am almost guaranteed to stay in “stealth” mode online?

On a related note… Is there a way for me to PREVENT other people from linking there website to my website??

Should I even be concerned if someone linked their website to my website??

Thanks,

Amy

No incoming links mean someone that has linked to your website.

No, though why would you want to? That is how the internet works! :slight_smile:

Because maybe the person/site linking to me is SCUMMY?! :eek:

I don’t want some nefarious site tapping into my site?!

And you are saying there is NO WAY to stop that?!

Amy

You have no control over incoming links to your site whatsoever. You can’t even tell where the visitors to your site come from unless they tell you (via the user controllable REFERRER header).

You can’t control that. Search Engines won’t “penalize” you for having links from “scummy” websites.

Unless someone hacks your server, they can’t “tap into” your site.

Correct. There is NO WAY to stop that. Just like it’s impossible for you to stop me writing “paranoid” right now :slight_smile:

Hi,

To avoid web spiders/crawlers to cache the web pages,in two ways

  1. In robots.txt file we can instruct for not to crawl.
    2)instruct in coding part.

Regards
Raghavendra

Hi Amy

If you block search engines from listing your site via robots.txt (which is the recognised way) then its highly unlikely anyone will find the site anyway… so how will they link to it? A lucky guess on your URL maybe but thats highly unlikely… akin to search for a needle in a haystack… and the internet is one very large haystack :smiley:

As long as you have got the robots.txt in place you should be fine.

Joe

Blocking or restricting web spiders can prevent your website from Bandwidth lose.
Allowing all web spiders can eat your Bandwidth dramatically.
Select the useful spiders list and allow them to crawl into your website.

robot.txt is an exclusion protocol used to prevent the web spiders to access certain pages of a site. This can be carried out in the following format.

  1. Allows web robots to visit all files

    User-agent:*
    Disallow:

  2. Allows all web robots out

    User-agent:*
    Disallow: /

  3. By specifying the directory name between two slashes in Disallow, denies the web spiders to access to those specified directory.

    User-agent:*
    Disallow: /cgi-bin/
    Disallow: /images/