How crawl content that is protected by a password?

I am in need of suggestions for how I can allow search engines to crawl content that is protected by a password.

I have read several posts on this subject and the consensus is that this violates Google’s policies since it requires that different content be presented to the spider (the content) vs. that which is presented to the site visitor (a Login page). I may be “paddling against the current” on this but I would argue that what Google wants to do is present its users with SERP listings that link to “relevant content”. If that is an accurate understanding of what Google is trying to deliver then wanting to get indexed content that is very relevant but that requires a password is compatible with the spirit (if not the letter) of their “law”. I realize that a reading of the Google Webmaster pages says that attempts to show Google a page that is different from what a visitor sees is not “legit” but there must be many, many other sites that have this legitimate need and it just doesn’t seem right to interpret this need as any sort of attempt to “cheat” for a higher SERP ranking.

Having said that, and also saying that I do not need DoD level security, can anyone give me some ideas on how to accomplish this?

I have some ideas that might solve “half” of the problem. For example, if a document (a PDF file) is placed in an obfuscated folder that is not linked to by any other pages in the site it will not be “accessible” to a visitor. (I realize that if the URL is known the PDF can be accessed directly but as I said I don’t need a really high level of security.) Then, (I think) if that URL is placed in a Site Map (that is not shown to the visitor) that will get the document crawled and indexed. When a request for the URL is presented to the server, code can look at the User-Agent to determine if a spider or a “real person’s browser” is requesting the page and can either server up the “raw” PDF for indexing or a Login page as appropriate. But this last part has some real problems since I understand that Google and other search engines sometimes spoof the User-Agent for exactly the purpose of discovering such a technique. Use of the agent’s IP address might be harder to detect but how do you determine and manage such IP addresses?

I don’t know if this approach can be fixed up or if some other approach would be better but I would certainly appreciate any suggestions the community has to offer. Again, my intent is in no way to “game the system” and “cheat” my way to a higher SERP ranking but rather to just reconcile two completely legitimate requirements: providing a certain level of document protection and getting the SEO benefit of the rich content that is being protected.

Thank you for your help.

The question is … why would you want Google to be able to index content if the general public then can’t access it? Why does the page need to be password protected?

Hi Stevie,

Thanks for your reply. My answer is in two parts:

  1. The content being protected is a large collection of business methodologies and they will be available to literally anyone after a (no cost) simple registration. The registration is required by my client because they want to track the correlation between interest in the documents and people who have taken training classes, bought books, made previous inquiries or have been consulting clients. But the point is that the content is available to absolutely anyone and it also contains very rich content that is relevant to the focus of the site.

  2. I used the “content that is protected” description as a simple description of a more general problem. Other content (that requires no registration) is to be presented to visitors in a Flash based viewer that allows for full viewing but doesn’t allow for copying or editing or removal of my client’s name and copyright. (We realize that someone can manually retype the content but the client feels that this would be a sufficient barrier to misuse of the materials - not perfect but the best that can be done.) The particular Flash based technology doesn’t allow for Google to index the text content so this problem is like the “protected content” problem. Again, the content is available to “the general public” but it can’t be directly indexed. A solution to the “protected content” would probably also solve this other problem.

So, my thinking is that getting this “available to the general public content” indexed is a legitimate objective but in an attempt to prevent “cheating” Google seems to have made doing this a real challenge. I hope this better explains why this needs to be done. Can you provide any suggestions in that regard?

Thanks for your time.

Making people register to get content that is freely available is very unpopular. A large number of people will assume that you’re just doing it to collect email addresses that you can then send spam to. If people do register, you have no way of knowing whether they’ve used a fake name/email, so you won’t necessarily know if their name already appears on your list. You’re also likely to get people registering multiple accounts when they either forget their username/password, or forget even that they’ve registered before. As such, I don’t know how much meaningful data you can realistically expect to collect.

I would prefer to give out a different ‘landing page’ URL in your classes, books etc - one that is hidden from Google, but then sends to the ‘real’ landing page that you steer Google to using (or alternatively is just a copy of it). That way, you can count hits on the two entry routes separately. (Although TBH Google Analytics could probably do as good a job of tracking it).

Flash-based viewers suck. Seriously. If I encounter a site that makes me use a Flash-based viewer to access its content, without any option to go to a user-friendly, easy-to-use version that allows me to interact the way I want to using normal browser controls … well, more often than not, I’ll just close the site down and go somewhere else.

If you make your content available in an awful and inaccessible format like Flash, you are very likely to find that people who rate the content but hate the format do go to the effort of copying it and putting it on their own websites - not necessarily to make a profit, but as a service to all the other people who can now enjoy the content without having to suffer your Flash interface (and also to make the point that you can’t stop people from copying anything that’s on the internet). And because their site will be in an accessible format, Google will find it and will almost certainly rank it higher than yours.

I have already made most of your arguments to the client but simply putting up all these methodologies in “raw” PDF format is not acceptable to them.

I agree with most of what you say but there are a few things in this situation that mitigate some of the concerns that you express. First is that the target audience is not “the general public” but a very specific large enterprise business niche that is populated by people who are highly experienced in certain business skills but who are generally not technology savvy. The client believes that the mind set of this target market is not one to be aggravated by having to register or to use a Flash player to view the content and that they don’t have the time, inclination or skill set to move the content to another web site. (They are however concerned about making it easy for someone to take their content and replace their name and copyright. These are 40+ page documents so manual retyping while viewing a Flash player is unlikely.)

In any case, I must do my job and my job is to meet the client’s requirements (as long as they are not illegal or unethical, of course) so that is why I need to find a solution to this problem.

You could watermark the PDF content to prevent copying.

But, someone who’s determined enough to copy the content will copy the content no matter how difficult you make it.

If you have the full version of Adobe Acrobat, you can set the PDF to prevent printing, copying or extracting of content, so if you set the security settings high enough, you can prevent people from ripping it off by any method short of re-typing or Print Screen + OCR.

Hi Stevie,

Thanks for the suggestion. I had discussed that with the client and they aren’t happy with the fact that there are numerous (even free) apps available that will strip off the PW (for Copy, Print, …). However, I realize that this may be the best (along with adding a watermark, per Force Flow’s post) that we can do.

Thanks for you comments on this.