Why Pages Disallowed in robots.txt Still Appear in Google

Share this article

robots.txt disallow
robots.txt is a useful file which sits in your web site’s root and controls how search engines index your pages. One of the most useful declarations is “Disallow” — it stops search engines accessing private or irrelevant sections of your website, e.g.

Disallow: /junk/
Disallow: /temp/
Disallow: /section1/mysecretpage.html
You can even block search engines indexing every page on your domain:

User-agent: *
Disallow: /
I’m not sure why anyone would do this, but someone, somewhere will not want their site to appear in search engine results. However, blocked pages can still appear in Google. Before you step on your soapbox to rant about Google’s violation of robots.txt and the company’s abusive control of the web, take a little while to understand how and why it happens. Assume you have a page at http://www.mysite.com/secretpage.html containing confidential information about your company’s new Foozle project. You may want to share that page with partners, but don’t want the information to be public knowledge just yet. Therefore, you block the page using a declaration in http://www.mysite.com/robots.txt:

User-agent: *
Disallow: /secretpage.html
A few weeks later, you’re searching for “Foozle” in Google and the following entry appears:

mysite.com/secretpage.html

How could this happen? The first thing to note is that Google abides with your robots.txt instructions — it does not index the secret page’s text. However, the URL is still displayed because Google found a link elsewhere, e.g.

<a href="http://mysite.com/secretpage.html">Read about the new Foozle project…</a>
Google therefore associates the word “Foozle” with your secret page. Your URL might appear at a high position in the search results because Foozle is a rarely-used term and your page is the sole source of information. In addition, Google can show a page description below the URL. Again, this is not a violation of robots.txt rules — it appears because Google found an entry for your secret page in a recognized resource such as the Open Directory Project
. The description comes from that site rather than your page content.

Can Pages Be Blocked?

There are several solutions that will stop your secret pages appearing in Google search results. 1. Set a “no index” meta tag Google will never show your secret page or follow its links if you add the following code to your HTML <head>:

<meta name="robots" content="no index, no follow" />
2. Use the URL removal tool Google offer a URL removal tool within their Webmaster Tools. 3. Add authentication
Apache, IIS, and most other web servers offer basic authentication facilities. The visitor must enter a user ID and password before the page can be viewed. This may not stop Google showing the page URL in results, but it will stop unauthorized visitors reading the content. 4. Review your publishing policies If you have top-secret content, perhaps you shouldn’t publish those documents on a publicly accessible network!

Frequently Asked Questions (FAQs) about Pages Disallowed in Robots.txt

What is the purpose of a robots.txt file?

A robots.txt file is a text file that webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. It essentially acts as a set of rules for bots to follow, telling them which pages they can or cannot retrieve. So, when a search engine robot is looking at a site, it first checks for the robots.txt file. If it finds one, it will read the file’s instructions to see what it’s allowed to do.

Why are pages disallowed in robots.txt still appearing in Google search results?

Even if a page is disallowed in the robots.txt file, it can still appear in Google search results. This is because Google can discover it through other means, such as external links pointing to it. Google respects the robots.txt directives, but it doesn’t guarantee non-indexing. To prevent a page from appearing in Google’s search results, the noindex directive should be used.

What is the difference between disallow and noindex?

The Disallow directive in the robots.txt file tells search engine bots not to crawl a page. However, it doesn’t prevent the page from being indexed. On the other hand, the Noindex directive, which is used in a meta tag on the page itself, tells search engines not to index the page. If a page is not indexed, it won’t appear in search results.

How can I prevent a page from appearing in Google’s search results?

To prevent a page from appearing in Google’s search results, you should use the Noindex directive. This can be done by adding a meta tag to the head section of your page with the content=”noindex”. Remember, the Noindex directive is more powerful than the Disallow directive in the robots.txt file.

Can I use both Disallow and Noindex directives for the same page?

Yes, you can use both Disallow and Noindex directives for the same page. However, it’s important to note that if a page is disallowed in the robots.txt file, search engine bots won’t be able to see the Noindex directive on the page. This is because the Disallow directive prevents bots from crawling the page, and they need to crawl the page to see the Noindex directive.

How can I check if a page is disallowed in the robots.txt file?

You can check if a page is disallowed in the robots.txt file by using the Robots Testing Tool provided by Google. This tool allows you to see exactly how Googlebot will interpret the directives in your robots.txt file.

What happens if I disallow all pages in the robots.txt file?

If you disallow all pages in the robots.txt file, search engine bots won’t be able to crawl any pages on your site. However, this doesn’t mean that your pages won’t appear in search results. As mentioned earlier, pages can still be indexed if they’re discovered through other means, such as external links.

Can I disallow specific bots in the robots.txt file?

Yes, you can disallow specific bots in the robots.txt file. This can be done by specifying the User-agent of the bot you want to disallow. For example, to disallow Googlebot, you would write “User-agent: Googlebot” followed by “Disallow: /”.

How long does it take for changes in the robots.txt file to take effect?

The time it takes for changes in the robots.txt file to take effect can vary. It depends on how often search engine bots crawl your site. However, you can expedite the process by submitting your updated robots.txt file to Google via the Search Console.

Can I use the robots.txt file to block specific content types?

Yes, you can use the robots.txt file to block specific content types. For example, if you want to block all .jpg images on your site from being crawled, you can do so by adding a Disallow directive for “*.jpg”.

Craig BucklerCraig Buckler
View Author

Craig is a freelance UK web consultant who built his first page for IE2.0 in 1995. Since that time he's been advocating standards, accessibility, and best-practice HTML5 techniques. He's created enterprise specifications, websites and online applications for companies and organisations including the UK Parliament, the European Parliament, the Department of Energy & Climate Change, Microsoft, and more. He's written more than 1,000 articles for SitePoint and you can find him @craigbuckler.

Google Tutorials & Articlessearch
Share this article
Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week