robots.txt is a useful file which sits in your web site’s root and controls how search engines index your pages. One of the most useful declarations is “Disallow” — it stops search engines accessing private or irrelevant sections of your website, e.g.
Disallow: /junk/ Disallow: /temp/ Disallow: /section1/mysecretpage.html
You can even block search engines indexing every page on your domain:
User-agent: * Disallow: /
I’m not sure why anyone would do this, but someone, somewhere will not want their site to appear in search engine results.
However, blocked pages can still appear in Google. Before you step on your soapbox to rant about Google’s violation of robots.txt and the company’s abusive control of the web, take a little while to understand how and why it happens.
Assume you have a page at http://www.mysite.com/secretpage.html containing confidential information about your company’s new Foozle project. You may want to share that page with partners, but don’t want the information to be public knowledge just yet. Therefore, you block the page using a declaration in http://www.mysite.com/robots.txt:
User-agent: * Disallow: /secretpage.html
A few weeks later, you’re searching for “Foozle” in Google and the following entry appears:
How could this happen? The first thing to note is that Google abides with your robots.txt instructions — it does not index the secret page’s text. However, the URL is still displayed because Google found a link elsewhere, e.g.
<a href="http://mysite.com/secretpage.html">Read about the new Foozle project…</a>
Google therefore associates the word “Foozle” with your secret page. Your URL might appear at a high position in the search results because Foozle is a rarely-used term and your page is the sole source of information.
In addition, Google can show a page description below the URL. Again, this is not a violation of robots.txt rules — it appears because Google found an entry for your secret page in a recognized resource such as the Open Directory Project. The description comes from that site rather than your page content.
Can Pages Be Blocked?
There are several solutions that will stop your secret pages appearing in Google search results.
1. Set a “no index” meta tag
Google will never show your secret page or follow its links if you add the following code to your HTML <head>:
<meta name="robots" content="no index, no follow" />
2. Use the URL removal tool
Google offer a URL removal tool within their Webmaster Tools.
3. Add authentication
Apache, IIS, and most other web servers offer basic authentication facilities. The visitor must enter a user ID and password before the page can be viewed. This may not stop Google showing the page URL in results, but it will stop unauthorized visitors reading the content.
4. Review your publishing policies
If you have top-secret content, perhaps you shouldn’t publish those documents on a publicly accessible network!
Craig is a freelance UK web consultant who built his first page for IE2.0 in 1995. Since that time he's been advocating standards, accessibility, and best-practice HTML5 techniques. He's created enterprise specifications, websites and online applications for companies and organisations including the UK Parliament, the European Parliament, the Department of Energy & Climate Change, Microsoft, and more. He's written more than 1,000 articles for SitePoint and you can find him @craigbuckler.
Visual Studio Code: End-to-End Editing and Debugging Tools for Web Developers
Your First Year in Code