Noindex dilemma - How to please the search engines

Hi all,

I’ve got the following dilemma.

There’s a folder on my Apache server I want to stop Google and the others from indexing. It’s not secure data, merely a bunch of .php files which are “sections” of the main PHP pages which build up the side. Select main PHP files use these via the include method. This way I can easily easily change round what gets shown say in a given column on a page - call it a modular approach etc.

Simple enough right?

Right, the problem though is that almost all these “sections” in this folder have some links in the code, usually external but also internal. Google webmaster tools likes to report these as crawl errors because I block the folder via robots.txt.

To make Google happy I can unlock this folder (take it out of robots.txt) but I don’t want these PHP page (sections) indexed. They’re not full pages, just parts of pages as stated above, for this reason indexing them would be wrong.

What would you guys do in this case? The meta data noindex method won’t work since after all the php include on a main page there would be two or more instances of “<meta name=“Robots” content=“noindex”>” in the rendered page (index, then noindex x n) which would also confuse Google and probably remove pages from their index.

I can’t block it via robots.txt as I’ll get crawl errors (even though that’s a minor problem).

I guess I could change these section pages to a funny extension and then just block search engines indexing anything with an extension say as “.abc”.

What would you guys and gals have in mind?

Thanks in advance,

…or am I thinking too far?

If these section PHP pages don’t have a <head>, <body> section (tags) etc. and simply the code that gets read into the main page (whatever page the php include() is in) own <body> … </body> section then won’t search engines no index it anyway?

Then again search engines index word documents, pdf’s and various other stuff so perhaps what I posted above is a valid problem after all?

Thanks again,

I’m not sure exactly what your actual goal is… If you can elaborate maybe in a more concise manner?

I believe you need to assl all pages you do not want to be indexed to the robots.txt file

Basically my website is made up of “proper” PHP pages, that is pages that have a proper <head></head><body></body> including meta tags, content type, language ID and so forth.

I then have a host of other “module” PHP pages that only contain raw html/php code i.e. no <head></head><body></body> etc. Each “proper” PHP includes one or more of these “module” PHP pages. They make up the content in the sidebars as well as display banners. This way I can change what gets displayed in a sidebar or which banner easily without any heavy editing involved.

The problem is that in a lot of the “module” PHP pages are 301 redirects to external websites (i.e. a banner link) that Google Webmaster Tools will say are blocked (thus a crawl error) if I block the folder contain these “module” PHP pages in robots.txt.

I can unblock the folder contain the “module” PHP files but if I do this I risk Google and others possibly indexing these files which aren’t complete pages. I’m not sure if the search engines look for the <head></head><body></body> etc. before determening “right, this is a valid page, we’ll index it” or not.

I don’t want to put the meta noindex tag in these “modular” PHP files since if I do then after the whole page is render (proper + n * modular) they’ll be multiple meta tags, some reading index, follow, and all others reading noindex, nofollow. As a result the search engines will get confused and not index these pages, of they’re already index drop them from the search listings.

Does that make more sense now?

Thanks,

nsm,

My read of your first post is that you want to protect a folder from SE indexing. That folder’s contents are ONLY called by your site via PHP include() statements. Saying that, to me, means that the folder is “invisible” to the outside world, i.e., it will NOT be indexed (because there are no links to its files).

The cgi-bin is another typical folder like this which typically exists outside the webspace allocated to a domain name. It is accessed via a server redirect which makes it unavailable except during calls to a particular script. Since its files can be addressed externally, though, it’s not as “invisible” as your PHP scripts should be.

If you are completely annoyed at indexing problems, then move your “phpincludes” folder outside your webspace (have it at the same level as your httpdocs or public_html folder). PHP can still access the files via include statements but NOONE can access them directly (via the http protocol).

Regards,

DK

This is a non problem. You can block the “components” folder and it won’t have any effect on the rest of your site. Google doesn’t know the other pages are created by combining these other files, the combined pages are independent webpages which are indexed all on their own. So those links they contain are part of indexed pages and won’t be lost by blocking the component files individually.

You only have to think about what the external world sees, not how your code works internally, as far as SEO goes.

Thanks, makes things clear(er) now. Yes, no links, just PHP includes so following your logic this would mean SERP’s won’t index these incomplete PHP pages which get loaded into main PHP pages (i.e. add sidebar content, banners etc.).

The folder in which all the add-on PHP files are is in root on the server’s webspace. Trying to access this folder from the browser will yield a 403 error.

True, the site works fine if I block it out (the components folder) in robots.txt. It’s only Google Webmaster Tools that got me worried since it still crawls this folder (I guess it’s their way of checking is this site spammy with lots of hidden poor content) and reports links within these component PHP files can’t be crawled due to being blocked by robots.txt.

Why not put the pages you don’t want indexed/viewed/etc outside of public view (i.e. if your index.php is in /public_html why not put your other non-complete files to be included in /include) You could then include them via include(…/include/filename.php); which would not only allow you to include them but also prevent direct access to those files by search engines or anybody else.

So what you’re saying is search engines don’t crawl and/or index any files within a folder on the server providing its name is “include”? …or will I need to block that in robots.txt too, in which case I’ll go back to square one?

Thanks,

Search engines will crawl any web-accessible folder/files regardless of the name, if they can it (through a link or sitemap, generally).

Okay guys so the general consensus stemming from all your input is that we seem to agree that search engines won’t index any pages that are read into other “main” pages via the PHP include (or require) function? Agreed? In other words, search engines will only index pages we explicitly denote to them either via a href link or via the sitemap xml file.

Thanks,

If they don’t have a clue that a file or folder is there, they have no reason to crawl an imaginary URL. They have enough data to go through as it is. :slight_smile:

Right but they will know the folder thus files are there as it isn’t (the folder) blocked in robots.txt.

For as long as the search engines won’t index any files that aren’t linked to via (href’s) from different pages but instead are added into these other pages (via PHP include or require calls) I’m allright with this. I just don’t want Google or others linking to various PHP files that makeup the website but aren’t themselves “the” main website files. Allowing this would be unprofessional.

How would they know it’s there?

Nobody but you can read the PHP source of your pages and see that you’re including files from somewhere.

The only way they will know it is there is if you reference it in the robots.txt file. As long as it isn’t mentioned there they will not know that the folder exists unless something else they can see mentions it.

Right. I was under the impression that search engines crawl every file and folder they find starting from the root folder and digging their way through whatever they find (file or folder).

Given what you guys say I guess my thinking is wrong, perhaps I’m too paranoid about what is a no-issue, even if I can’t find a logical explanation as to why can’t search engines see folders. Sure, they need to follow links but basing all crawling activity only on this seems fragile (no plan B, C etc. if plan A (following links)) fails.

Looks like I’ll need to think the same as you guys and as a result put this worry to rest.

Search engine crawlers are no different from your own web browser. They have no ability to do anything beyond making the same HTTP requests you do when you type in URLs or click links. They do not have any access to the computer your website runs on. They have no way to get a list of files or directories that exist on that computer, just like we don’t. Web servers are normal PCs, you can’t get a list of directories on your neighbor’s personal computer, and Google can’t get a list of directories on the computer serving your website.

There is no plan B because the plan B you’re thinking of isn’t technically possible.

Okay. Good news then, means I don’t have to exclude the folder in robots.txt and now hopefully GWT won’t report crawl errors for those “blocked” external links (which were 301’s anyway).

Once again thanks for all your input.