Why Pages Disallowed in robots.txt Still Appear in Google

Contributing Editor

robots.txt disallowrobots.txt is a useful file which sits in your web site’s root and controls how search engines index your pages. One of the most useful declarations is “Disallow” — it stops search engines accessing private or irrelevant sections of your website, e.g.


Disallow: /junk/
Disallow: /temp/
Disallow: /section1/mysecretpage.html

You can even block search engines indexing every page on your domain:


User-agent: *
Disallow: /

I’m not sure why anyone would do this, but someone, somewhere will not want their site to appear in search engine results.

However, blocked pages can still appear in Google. Before you step on your soapbox to rant about Google’s violation of robots.txt and the company’s abusive control of the web, take a little while to understand how and why it happens.

Assume you have a page at http://www.mysite.com/secretpage.html containing confidential information about your company’s new Foozle project. You may want to share that page with partners, but don’t want the information to be public knowledge just yet. Therefore, you block the page using a declaration in http://www.mysite.com/robots.txt:


User-agent: *
Disallow: /secretpage.html

A few weeks later, you’re searching for “Foozle” in Google and the following entry appears:

mysite.com/secretpage.html

How could this happen? The first thing to note is that Google abides with your robots.txt instructions — it does not index the secret page’s text. However, the URL is still displayed because Google found a link elsewhere, e.g.


<a href="http://mysite.com/secretpage.html">Read about the new Foozle project…</a>

Google therefore associates the word “Foozle” with your secret page. Your URL might appear at a high position in the search results because Foozle is a rarely-used term and your page is the sole source of information.

In addition, Google can show a page description below the URL. Again, this is not a violation of robots.txt rules — it appears because Google found an entry for your secret page in a recognized resource such as the Open Directory Project. The description comes from that site rather than your page content.

Can Pages Be Blocked?

There are several solutions that will stop your secret pages appearing in Google search results.

1. Set a “no index” meta tag

Google will never show your secret page or follow its links if you add the following code to your HTML <head>:


<meta name="robots" content="no index, no follow" />

2. Use the URL removal tool

Google offer a URL removal tool within their Webmaster Tools.

3. Add authentication

Apache, IIS, and most other web servers offer basic authentication facilities. The visitor must enter a user ID and password before the page can be viewed. This may not stop Google showing the page URL in results, but it will stop unauthorized visitors reading the content.

4. Review your publishing policies

If you have top-secret content, perhaps you shouldn’t publish those documents on a publicly accessible network!

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Anonymous

    User-agent: *
    Disallow: /

    I’m not sure why anyone would do this, but someone, somewhere will not want their site to appear in search engine results.

    I use this in the early developpment of a website if the client wants to keep the site secret untill officcial launch date. And After the official launch i delete the file.

  • http://www.mikehealy.com.au cranial-bore

    It’s the difference between crawling and indexing. Google won’t crawl a page disallowed in robots.txt, but it might index if it comes to know about the page from other sources (external links). Logical when you know about it, but still quite unintuitive for a lot of people I would imagine.

    Google gave a good example of why they do it like this. There was a time when a lot of sites disallowed robots as a matter of course. The US Department of Motor Vehicles was apparently one. It wouldn’t help users to not index an authority site, so they used backlink info to get the site in the index.

  • http://www.optimalworks.net/ Craig Buckler

    @anonymous

    I use this in the early development of a website if the client wants to keep the site secret until official launch date. And After the official launch I delete the file.

    That idea has a couple of flaws. First, Google loves sites that are changed. You can capitalize on that during the development phase. Also, on launch day, wouldn’t it be better to have your site fully indexed? That won’t happen if it’s been blocked.

    If a company genuinely doesn’t want to appear until launch day, then password protect the site or put it on a private server.

  • Tambu

    webmaster tool’s url removal requires the page to return a 404 code, and that may not be the case, if the page is still online. So you have to do some user-agent cloaking…

    best solution is noindex meta tag

  • Anonymously

    Appears that there is a difference between crawling and indexing to google… which leads me to question how does one stop search engines from “indexing” non-HTML documents; for example: PDFs, PowerPoints, Word, text, etc.

    Thanks!

    (Guessing there is no way, just wanted to confirm…)

  • http://www.optimalworks.net/ Craig Buckler

    how does one stop search engines from “indexing” non-HTML documents; for example: PDFs, PowerPoints, Word, text, etc.

    If the page linking to those documents has <meta name=”robots” content=”no follow” /> or a rel=”nofollow” attribute in the link, the document shouldn’t be crawled or indexed.

  • http://www.soliantconsulting.com/ bengert

    If the page linking to those documents has or a rel=”nofollow” attribute in the link, the document shouldn’t be crawled or indexed.

    That works great if you have control of the page linking in but does not help if someone else is hot-linking to the document. At that point you could do some tricks like return a 404 if you don’t get a referral or if the referral is not your domain.