Programming - - By Craig Buckler

Why Pages Disallowed in robots.txt Still Appear in Google

robots.txt disallowrobots.txt is a useful file which sits in your web site’s root and controls how search engines index your pages. One of the most useful declarations is “Disallow” — it stops search engines accessing private or irrelevant sections of your website, e.g.

Disallow: /junk/
Disallow: /temp/
Disallow: /section1/mysecretpage.html

You can even block search engines indexing every page on your domain:

User-agent: *
Disallow: /

I’m not sure why anyone would do this, but someone, somewhere will not want their site to appear in search engine results.

However, blocked pages can still appear in Google. Before you step on your soapbox to rant about Google’s violation of robots.txt and the company’s abusive control of the web, take a little while to understand how and why it happens.

Assume you have a page at containing confidential information about your company’s new Foozle project. You may want to share that page with partners, but don’t want the information to be public knowledge just yet. Therefore, you block the page using a declaration in

User-agent: *
Disallow: /secretpage.html

A few weeks later, you’re searching for “Foozle” in Google and the following entry appears:

How could this happen? The first thing to note is that Google abides with your robots.txt instructions — it does not index the secret page’s text. However, the URL is still displayed because Google found a link elsewhere, e.g.

<a href="">Read about the new Foozle project…</a>

Google therefore associates the word “Foozle” with your secret page. Your URL might appear at a high position in the search results because Foozle is a rarely-used term and your page is the sole source of information.

In addition, Google can show a page description below the URL. Again, this is not a violation of robots.txt rules — it appears because Google found an entry for your secret page in a recognized resource such as the Open Directory Project. The description comes from that site rather than your page content.

Can Pages Be Blocked?

There are several solutions that will stop your secret pages appearing in Google search results.

1. Set a “no index” meta tag

Google will never show your secret page or follow its links if you add the following code to your HTML <head>:

<meta name="robots" content="no index, no follow" />

2. Use the URL removal tool

Google offer a URL removal tool within their Webmaster Tools.

3. Add authentication

Apache, IIS, and most other web servers offer basic authentication facilities. The visitor must enter a user ID and password before the page can be viewed. This may not stop Google showing the page URL in results, but it will stop unauthorized visitors reading the content.

4. Review your publishing policies

If you have top-secret content, perhaps you shouldn’t publish those documents on a publicly accessible network!