Go Back   SitePoint Forums > Forum Index > Manage Your Site > Search Engine Optimization
Newsletter FAQ Members List Calendar Mark Forums Read

New to SitePoint Forums? Register here for free!

SitePoint Sponsor
 
Reply
 
Thread Tools Display Modes
Old Nov 4, 2009, 11:50   #1
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
How do Webcrawlers work?

Hello everyone. This is my first post on SitePoint. I wasn't sure where to put this thread.

Can someone please explain how Webcrawlers work?

Can Google, for example, see every directory and file in your Web Root?

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 4, 2009, 17:50   #2
ShayneTilley
Team SitePoint
 
ShayneTilley's Avatar
 
Join Date: Apr 2007
Posts: 1,058
Hi Mike, Webcrawers are used by third parties to collect information about each page on your site. Predominately for search engines to try and match pages on your site with what people are searching for.

Putting it simply, a crawler will start on your home page, index that, then move to pages you've linked to from your front page, index that, then got to pages you've linked to and so on ...

You can control what pages are indexed and not through the robots.txt file or on each individual page with some code as well as control specific crawlers are allowed/not allowed to crawl a site. ie you might not want Google to crawl your site.

You'll probably hear about things like adding a 'site map' to ensure all pages are crawled, but these days crawlers and smarter and a well structured site won't need an map to get the whole site indexed.

If you want to see what a crawler sees, turn of CSS, images and JavaScript and browse your site.

Are you having trouble getting your site indexed in search engines or was it more general interest?
ShayneTilley is offline   Reply With Quote
Old Nov 4, 2009, 19:35   #3
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Quote:
Originally Posted by ShayneTilley View Post
Hi Mike, Webcrawers are used by third parties to collect information about each page on your site. Predominately for search engines to try and match pages on your site with what people are searching for.

Putting it simply, a crawler will start on your home page, index that, then move to pages you've linked to from your front page, index that, then got to pages you've linked to and so on ...

You can control what pages are indexed and not through the robots.txt file or on each individual page with some code as well as control specific crawlers are allowed/not allowed to crawl a site. ie you might not want Google to crawl your site.

You'll probably hear about things like adding a 'site map' to ensure all pages are crawled, but these days crawlers and smarter and a well structured site won't need an map to get the whole site indexed.

If you want to see what a crawler sees, turn of CSS, images and JavaScript and browse your site.

Are you having trouble getting your site indexed in search engines or was it more general interest?
Hi Shayne. Thank you for the response!

To answer your question, I guess my question is a bit broader, and pertains to a few websites I am working on - some for me and some to help others out for free (e.g. church).

Originally I wasn't crazy about letting Google molest - I mean "crawl" - my website projects.

However, last week I had a humbling experience...

I had been helping build a website for a non-profit organization dear to my heart, and after building a great website - at least for a web newbie like me - I discovered that there are LOTS of dummies out there that will type your URL into a Search Engine Box on Yahoo or Google and have no clue what the "Address Bar" is?! So, unless the website is "indexed" on Google, Yahoo, etc. then nobody knows how to find the website?!

Out ISP is telling us it could take 2 MONTHS before our website gets listed after we manually register with all of the search engines or pay them like $40 to do it for us?!

That is the dumbest IT mistake I have ever made?!

Now my church and this other organization have to spend all of their time explaining to interested people what an "Address Bar" is, and how how to type the URL in the Address Bar. And how they should NOT be searching when we give them a flyer with the church's URL already on it?!

(That is like me giving you my phone number, and you going home and doing a reverse lookup to find my name so you can then do a regular lookup to find the phone number I just gave you?! Man are people dense sometimes...)

So anyways...

In one case we decided to break down and pay the ISP for their service to get our website "listed" - even though that will take 2 months. However, now that I have to change the Robots.txt file from blocking everyone to opening things up, I am concerned about security issues and that maybe the Spiders, Bots, Crawlers, etc will someone how expose our Church's site and expose it's members info?!

Does that make sense?

It scares me silly to think that Google might get access to and list our Database Config File or Website Config File or anything else that is sensitive...

I have been working with computers most of my life but have never done anything online until now, and I am really worried that I'll do something seemingly innocent that will expose our entire congregation's dataset to the outside world.

Sorry for the rant, but that is where my original post was intended to eventually go.

Sincerely,


Mike

P.S. I hear God really socks it to IT people who compromise the data of his "flock"...
MikeTheMechanic is offline   Reply With Quote
Old Nov 5, 2009, 00:14   #4
Shaun(OfTheDead)
The World is Very Sexy
SitePoint Award Recipient
 
Shaun(OfTheDead)'s Avatar
 
Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
As Shayne said, if you turn off images, Javascript and stylesheets, you can imagine what a webcrawler sees.

SO... if your website config and database files are THAT open that a webcrawler can link to it... so too is it open to any random person out there! And I'd be more scared of that!

If your config files are sufficiently tucked away that any random person cannot access it... neither can the webcrawler access it. So if they are password protected or above your document root, etc.

Google isn't the devil.

--

Whatever you do, do NOT rely of nofollow links or robots.txt files to be the only barrier blocking your sensitive data!

Those nofollow instructions are merely suggestions. Good, rule-abiding webcrawlers, like Google's, will ignore the links you tell them to ignore.

Malicious webcrawlers will do what they want, regardless of your instructions. So again, just as you would for human users, password-protect or otherwise secure your sensitive data!

Shaun(OfTheDead) is offline   Reply With Quote
Old Nov 5, 2009, 01:02   #5
annescoffield
SitePoint Evangelist
 
annescoffield's Avatar
 
Join Date: Jul 2008
Posts: 459
Search engine bots crawl from webpage to another webpage through links so it's important check our internal pages and make sure there are no broken links.
annescoffield is offline   Reply With Quote
Old Nov 5, 2009, 01:15   #6
miak123
SitePoint Member
 
Join Date: Sep 2009
Posts: 6
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner.This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
miak123 is offline   Reply With Quote
Old Nov 5, 2009, 01:50   #7
juve20
SitePoint Addict
 
Join Date: Jul 2005
Posts: 269
good post! i just thought they only visit your site and thats it lol
juve20 is offline   Reply With Quote
Old Nov 5, 2009, 02:27   #8
ShayneTilley
Team SitePoint
 
ShayneTilley's Avatar
 
Join Date: Apr 2007
Posts: 1,058
Mike as shaun said you want to make sure all your sensitive data, anything that shouldn't be available to everyone is securely tucked away behind a security wall where a user name and password is required to access.

You can always make only your front page open - so people can find that through a search engine and that's it.

Crawlers are getting better at finding pages within a site, regardless is there is links or not.
ShayneTilley is offline   Reply With Quote
Old Nov 5, 2009, 04:02   #9
exoticshawls
SitePoint Enthusiast
 
exoticshawls's Avatar
 
Join Date: Aug 2009
Posts: 63
very helpful post..thanks buddy
exoticshawls is offline   Reply With Quote
Old Nov 5, 2009, 10:37   #10
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Wow! Looks like I opened up a can of worms now!

Quote:
Originally Posted by Shaun(OfTheDead) View Post
[font="Georgia"]As Shayne said, if you turn off images, Javascript and stylesheets, you can imagine what a webcrawler sees.
So, am I to take that to mean that a Web Crawler will *only* see HTML pages and cannot see non-HTML pages (e.g. JavaScript, CSS, JPEG, PHP, etc)?



Quote:
Originally Posted by Shaun(OfTheDead);
SO... if your website config and database files are THAT open that a webcrawler can link to it... so too is it open to any random person out there! And I'd be more scared of that!
Gee... I/we may have some major issues then... (Now I know why I preferred developing in a Client/Server environment.)

Right now, my church and the non-profit are strapped for cash and so I got them "Shared Hosting" accounts at GoDaddy. Everyone at GoDaddy tells me that shared hosting is perfectly safe if we aren't doing credit card processing, and some have even said that a lot of people used shared hsting for e-commerce.

These website won't involve $$$ at this point, but they will contain *sensitive* information like Member's Names, Address, E-mails, Passwords, and so on. And personally I think any information people trust you with should be protected, because I'd be mad as can be if someone gave out my contact info online!

I am *hoping* that I can build a reasonably secure website - using our shared hosting account - for church members to create accounts and register for upcoming events and service projects.

One big problem with a Shared Account is that all you get is a "Web Root" and cannot go above that. GoDaddy said that as long as I obscure where I put my Database_Connection file, I should be okay, although I know the PHP/MySQL books I have read suggest putting Database Connection and Website Config files outside the Web Root.

Back on topic...

So how do Web Crawlers relate to and possibly jeopardize the following files?? (e.g. "database_connect.php", "config.inc.php", ".htaccess", and "php.ini")


Quote:
If your config files are sufficiently tucked away that any random person cannot access it... neither can the webcrawler access it. So if they are password protected or above your document root, etc.
Okay, but the thing I want to be 100% certain of is whether or not a Web Crawler can "index" and "look inside" sensitive files in our Web Root (e.g. "database_connect.php", "config.inc.php", ".htaccess", and "php.ini")??

The more I think about it, the more Web Crawlers do seem evil if they index every single file in your Web Root?!


Quote:
Whatever you do, do NOT rely of nofollow links or robots.txt files to be the only barrier blocking your sensitive data!

Those nofollow instructions are merely suggestions. Good, rule-abiding webcrawlers, like Google's, will ignore the links you tell them to ignore.

Malicious webcrawlers will do what they want, regardless of your instructions. So again, just as you would for human users, password-protect or otherwise secure your sensitive data!
Well, I am storing User Log-In info in a MySQL database and encrypting the passwords, but if Web Crawlers tell the whole world where my database and give out the connection info, then someone could easily hack into the database?!

I am starting to feel nauseous this morning...

I hope this isn't as bad as it is sounding, because I thought I was doing things in a secure manner?! And combined with the fact that we have a MAJOR ISSUE with no one being able to easily locate our websites, I need to get us listed on Google and Yahoo asap!!!

Hopefully there is a relatively easy solution here...

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 5, 2009, 10:49   #11
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Quote:
Originally Posted by ShayneTilley View Post
Mike as shaun said you want to make sure all your sensitive data, anything that shouldn't be available to everyone is securely tucked away behind a security wall where a user name and password is required to access.
Shayne, I will let you read my earlier reply, but to touch on what I said earlier, we chose to use "Shared Hosting" for now to see how the website is received and then maybe we can upgrade to a "Virtual Server" or a "Dedicated Server". But for now, shared hosting is really the only option until early 2010.

I do have to log in to see my files in our GoDaddy account, but I suppose that if you knew the file structure I created, you could easily snoop around by modifying the URL, right?

Then again, the book I was studying out of, said that as long as you lace all of your database connection and website configuration settings in a PHP file that a hacker cannot view the contents of these files...

(But I am feeling way over my head on all of this?!)

All I know is that the idea of giving "Web Crawlers" permission to map out our entire website and make it public on the Internet sounds really scary!!! (It feels as bad as letting Google Maps post pictures of your Street and House on the Internet for everyone to see?!)


Quote:
You can always make only your front page open - so people can find that through a search engine and that's it.
That is a good idea, and I had the same idea. However, if I only let our index.php be "indexed", can the Web Crawler (and bad people) just jump on all of the links on the Home Page (i.e. index.php) and then methodically map out where ever file is in the Web Root??


Quote:
Crawlers are getting better at finding pages within a site, regardless is there is links or not.
That, and Shaun(OfTheDead), said that "Bad Bots" will just ignore things like Robot.txt and "No Follow" comments anyways, right?

Man, what a mess...

Sincerely,


Mike

Last edited by MikeTheMechanic; Nov 5, 2009 at 10:54. Reason: Correction
MikeTheMechanic is offline   Reply With Quote
Old Nov 5, 2009, 14:30   #12
Shaun(OfTheDead)
The World is Very Sexy
SitePoint Award Recipient
 
Shaun(OfTheDead)'s Avatar
 
Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
They aren't bad. Without them, surfing the 'Net would be near impossible. Indexing pages is the only way today's search engines can know what's out there.

They can only see your parsed PHP... that is, the output of your PHP files. They won't see any of the variables or statements or database connections or anything. Only the output; not the PHP code itself. Same as it would be for any other web-browser.

Therefore, just as your hosts re-assured you, if your database connection file doesn't output/echo anything, there'd be nothing for the bots to read.

ALSO... if you password protect a folder, your server will not give away any of the contents inside; not to bots nor web-browsers nor anyone (unless they have the password, which they won't). The contents will be off limits. Your server will keep them that way.

As far as I know, web-crawlers cannot index your database directly; only the data you output from it via a web-page. But not the database itself. But even if they could, they'd again be stuck without the passwords, which they won't have if you keep 'em safe.

Breathe.

Don't worry, everything will be fine.


Quote:
Sincerely,

Mike
Earnestly,
Shaun

Shaun(OfTheDead) is offline   Reply With Quote
Old Nov 5, 2009, 14:41   #13
Shaun(OfTheDead)
The World is Very Sexy
SitePoint Award Recipient
 
Shaun(OfTheDead)'s Avatar
 
Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
Quote:
Originally Posted by MikeTheMechanic
So, am I to take that to mean that a Web Crawler will *only* see HTML pages and cannot see non-HTML pages (e.g. JavaScript, CSS, JPEG, PHP, etc)?
They can read your Javascript but most web-crawlers today (as far as I know) are not be able to execute it. Some can.

They can read your CSS, but again, most would not execute it as far as I know, nor do they have a reason to. CSS is for aesthetics. Bots don't have sight. They don't care about aesthetics, only data.

They can maybe read the data that make up a JPEG or other image file; but again... Bots don't have eyes. They can't look at the image and see what it is. They can assume sometimes, based on your alt tags or the data around, but they can't see pictures. Only data.

And PHP, as I said, they can only see what is parsed and output by you. They cannot see your code itself. Your server will not give that up. Your server interprets the code and only gives away the parsed data.

Shaun(OfTheDead) is offline   Reply With Quote
Old Nov 5, 2009, 15:16   #14
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Quote:
Originally Posted by Shaun(OfTheDead) View Post
They aren't bad. Without them, surfing the 'Net would be near impossible. Indexing pages is the only way today's search engines can know what's out there.

They can only see your parsed PHP... that is, the output of your PHP files. They won't see any of the variables or statements or database connections or anything. Only the output; not the PHP code itself. Same as it would be for any other web-browser.
Okay, that is good.

I actually did a test this morning, and what you say seems to be true. If I paste in the URL to say, my database_connection.php file, all that comes up in the browser is a white screen since the file is pure PHP with no associated output.


Quote:
ALSO... if you password protect a folder, your server will not give away any of the contents inside; not to bots nor web-browsers nor anyone (unless they have the password, which they won't). The contents will be off limits. Your server will keep them that way.
Okay, but two follow-up questions...

1.) Do you think I can password-protect a file in my Web Root since we have a Shared Plan?

Is that something that requires root access at the server level, or is it a fairly common thing for a Web Host to let users do in their accounts?


2.) If I password protect a directory call, maybe, "Settings", wouldn't that also stop PHP from accessing it?

The logic being if the password locks you and I out, wouldn't it break, say, a PHP "include"?


Quote:
As far as I know, web-crawlers cannot index your database directly; only the data you output from it via a web-page. But not the database itself. But even if they could, they'd again be stuck without the passwords, which they won't have if you keep 'em safe.
I'm not worried about the database - well not much - just more all of those plain-text files like my Database Connection Setting (i.e. username and password), and .htaccess and php.ini


Quote:
Breathe.

Don't worry, everything will be fine.
Ha ha. Okay, just wanted to be safe versus sorry.

So one more question...

Would it still be prudent to either set up my Robots.txt file to "disregard" directories like...

/database
/images
/includes

and/or password-protect them, or is that no necessary?

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 5, 2009, 18:46   #15
Shaun(OfTheDead)
The World is Very Sexy
SitePoint Award Recipient
 
Shaun(OfTheDead)'s Avatar
 
Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
Quote:
Originally Posted by MikeTheMechanic
Okay, that is good.

I actually did a test this morning, and what you say seems to be true. If I paste in the URL to say, my database_connection.php file, all that comes up in the browser is a white screen since the file is pure PHP with no associated output.
:)-

Quote:
1.) Do you think I can password-protect a file in my Web Root since we have a Shared Plan?
Of course!

It should be in your hosting control panel somewhere. If you can't find it, a quick question to your host's customer service should guide you right to it.


Quote:
2.) If I password protect a directory call, maybe, "Settings", wouldn't that also stop PHP from accessing it?

The logic being if the password locks you and I out, wouldn't it break, say, a PHP "include"?
Nope.

Quote:
I'm not worried about the database - well not much - just more all of those plain-text files like my Database Connection Setting
Don't save them as plain-texts! Save them as .php .

Quote:
and .htaccess and php.ini
The server would forbid those (403 error).

Quote:
Would it still be prudent to either set up my Robots.txt file to "disregard" directories like...
I actually don't bother to set up robots.txt files. I used to before, but it became too tedious and I didn't see any real benefit.

Quote:
and/or password-protect them, or is that no necessary?
If it'd make you feel more comfortable, password protect them. But as long as there's no sensitive data output when the files are parsed, it may be an irrelevant measure.

Quote:
Sincerely,

Mike
Hungrily,
Shaun

Shaun(OfTheDead) is offline   Reply With Quote
Old Nov 5, 2009, 19:11   #16
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Quote:
Originally Posted by Shaun(OfTheDead) View Post
-
Of course!

It should be in your hosting control panel somewhere. If you can't find it, a quick question to your host's customer service should guide you right to it.
Okay, I will ask my ISP tonight after supper.


Quote:
Don't save them as plain-texts! Save them as .php .
Oops. I didn't mean plain-text files literally, I just meant PHP files that aren't password protected and that contain visible, plain-text.


Quote:
The server would forbid those (403 error).
You lost me here.

Were you saying that they cannot access the .htaccess and php.ini files directly through the web browser or something else?


Quote:
If it'd make you feel more comfortable, password protect them. But as long as there's no sensitive data output when the files are parsed, it may be an irrelevant measure.
The only files that have sensitive content would be my configuration files which are always .php

(Well, I guess it would be bad news if someone hacked the .htaccess or php.ini but it sounds like that is harder to do.)


Quote:
Hungrily,
Shaun
Ha ha. You certainly have a unique way of signing your messages?!

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 5, 2009, 19:50   #17
Shaun(OfTheDead)
The World is Very Sexy
SitePoint Award Recipient
 
Shaun(OfTheDead)'s Avatar
 
Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
Quote:
Originally Posted by MikeTheMechanic
Okay, I will ask my ISP tonight after supper.
Your ISP is hosting your website?

Quote:
Oops. I didn't mean plain-text files literally, I just meant PHP files that aren't password protected and that contain visible, plain-text.
Well, unless there's sensitive information output on there (like passwords) then having them indexed would be a good thing!

You'll get more website traffic for your customer. More people will see your site. People searching for your church online will be more likely to find it.

Repeat after me, "Indexing is good!"


Quote:
You lost me here.
Browse to your .htaccess file and see what happens :)-

A bot will get the same message.

Quote:
Were you saying that they cannot access the .htaccess and php.ini files directly through the web browser or something else?
A bot is essentially an automated web-browser.

Any web-browser, Firefox, Safari, a webcrawler, will not see those files directly.

They can only see what the server serves them. And the server is trained to not serve those files.


Quote:
The only files that have sensitive content would be my configuration files which are always .php
Well there you go.

Quote:
(Well, I guess it would be bad news if someone hacked the .htaccess or php.ini but it sounds like that is harder to do.)
A webcrawler can't hack. A webcrawler can only browse, and is subject to what the server allows it to see.

Quote:
Ha ha. You certainly have a unique way of signing your messages?! :)
Generally,
Shaun

Shaun(OfTheDead) is offline   Reply With Quote
Old Nov 5, 2009, 20:25   #18
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Quote:
Originally Posted by Shaun(OfTheDead) View Post
Your ISP is hosting your website?
Oops. No, sorry, got my acronyms mixed up!

My ISP at home is AT&T and our church's Web Host is GoDaddy.


Quote:
Well, unless there's sensitive information output on there (like passwords) then having them indexed would be a good thing!

You'll get more website traffic for your customer. More people will see your site. People searching for your church online will be more likely to find it.

Repeat after me, "Indexing is good!"
Ha ha. Yah, I see you keep saying that.


Quote:
Browse to your .htaccess file and see what happens -

A bot will get the same message.
Okay, you are correct there, however, if I type in...

www.GreatChurch.org/php.ini

or

www.GreatChurch.org/robots.txt

all of the file contents are displayed.

Not sure how bad that is, but it can't be a good idea telling people how your Web Root is set up?!


Quote:
A webcrawler can't hack. A webcrawler can only browse, and is subject to what the server allows it to see.
Okay, I now see that it is a Read-Only thing - just like my browser - however, what about...

1.) Exposing our Web Root's Directory Structure?

2.) Files that can be "read" when they are browsed (e.g. php.ini, robots.txt, etc)?

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 5, 2009, 20:41   #19
My220x
SitePoint Zealot
 
My220x's Avatar
 
Join Date: Dec 2008
Location: United Kingdom
Posts: 136
Robots.txt shouldn't contain anything important so don't worry if people can see it and as for php.ini it's showing nothing private.
My220x is offline   Reply With Quote
Old Nov 5, 2009, 20:58   #20
Shaun(OfTheDead)
The World is Very Sexy
SitePoint Award Recipient
 
Shaun(OfTheDead)'s Avatar
 
Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
Quote:
Originally Posted by MikeTheMechanic
if I type in...

www.GreatChurch.org/robots.txt

all of the file contents are displayed.
Yeah, so what?

There's nothing really useful to learn from your robots.txt file other than the name of some of your folders maybe.

As I'd said before, I'd hope that a robots.txt file isn't the only thing standing in the way of your sensitive data!

--

They're a waste of time (robots files). Maybe someone more knowledgeable can correct me, but in my view they are pointless.


Interesting. On my host, the php.ini file is forbidden too. But...

Quote:
but it can't be a good idea telling people how your Web Root is set up?!
...Most web roots are set up the same way anyhow, eh.

The file structure of a server is no secret. That still doesn't benefit someone who has no access.

Shaun(OfTheDead) is offline   Reply With Quote
Old Nov 5, 2009, 23:11   #21
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Quote:
Originally Posted by Shaun(OfTheDead) View Post
Yeah, so what?

There's nothing really useful to learn from your robots.txt file other than the name of some of your folders maybe.
Okay.


Quote:
Interesting. On my host, the php.ini file is forbidden too. But...
Probably an Apache setting.


Quote:
Most web roots are set up the same way anyhow, eh.

The file structure of a server is no secret. That still doesn't benefit someone who has no access.
Okay, I'll trus you on this. (Just remember, that if you are wrong, my "boss" (i.e. God) might be very mad...)

I guess it sounds like I can "open the flood gates" to the world tomorrow and use GoDaddy's Site Rank Wizard contraption to help get our website indexed with the major search engines and maybe even ranked down the road.

If anyone sees anything that I am doing - or that has been said - as inaccurate or insecure, please speak up. Any coaching on all of this is very welcome!

I'm really tired an off to bed now, but I guess tomorrow i will create a basic Robots.txt file just to eliminate over indexing of some directories and then click "submit" with GoDaddy and hope things work out as intended?!

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 6, 2009, 09:54   #22
Shaun(OfTheDead)
The World is Very Sexy
SitePoint Award Recipient
 
Shaun(OfTheDead)'s Avatar
 
Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
Quote:
Originally Posted by MikeTheMechanic
Okay, I'll trus you on this. (Just remember, that if you are wrong, my "boss" (i.e. God) might be very mad...)
Well that's a risk I'll gladly accept.

Quote:
If anyone sees anything that I am doing - or that has been said - as inaccurate or insecure, please speak up.
So paranoid!

Don't worry, man. It will be fine.


Quote:
I guess it sounds like I can "open the flood gates" to the world tomorrow and use GoDaddy's Site Rank Wizard contraption to help get our website indexed with the major search engines and maybe even ranked down the road.
That sounds like a good plan

Good luck!


Quote:
Sincerely,

Mike
Ironically,
Shaun

Shaun(OfTheDead) is offline   Reply With Quote
Old Nov 6, 2009, 12:26   #23
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
Thanks for all of the comments, everyone!

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 6, 2009, 12:43   #24
MikeTheMechanic
SitePoint Enthusiast
 
MikeTheMechanic's Avatar
 
Join Date: Nov 2009
Location: Oregon
Posts: 48
You know, I actually have more questions. I am not sure where to post them or if I should just tack them on here since they seem to be related?!

I'll ask here, and if I should start a new thread, just say so, Mods.

On the website for my church, and also for this non-profit group, there will definitely be areas that I do not want people to know about or for Web Crawlers to index...

I was wondering if it would be easier if I created a directory structure like this in my Web Root...

WebRoot

WebRoot/Public <--- For everyone to see and to be indexed

WebRoot/MembersOnly <--- Registered members and should not be indexed

WebRoot/Utilities <--- Admin stuff like "config" files, images, includes, etc. that also should not be indexed


Here is my logic...

Web pages like Registration Pages, Password Resets, Log-In, User Preferences, and things which are "member-only" areas, do not need to be indexed by Web Crawlers since they are either "utility" type pages that no ne would search for in Google, or that are for only members to see, and the members would already know where to look!!

Likewise, "utility" web pages and directories like all of websites images, includes files, configuration files, and Administrator web pages also do not need to be indexed by Google since they really are not for the public's eyes to see.

By segmenting all of our websites files into this structure, it would be very easy to change content from "piblic" to "private" and visa-versa as things change.

Does that seem like a good approach?

(I am very big an spending extra time up front organizing things so down the road the website (or whatever) remains orderly and is easier to scale.)

Interested in everyone's feedback!

Sincerely,


Mike
MikeTheMechanic is offline   Reply With Quote
Old Nov 6, 2009, 18:07   #25
tomovuk
WEBINSANE MEDIA
 
tomovuk's Avatar
 
Join Date: Oct 2005
Location: Montenegro
Posts: 770
I strongly suggest organizing your web site through Google Webmaster Tools.
tomovuk is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread | Next Thread »

Thread Tools
Display Modes

 
Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Sponsored Links
 
Forum Jump


All times are GMT -7. The time now is 00:41.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Copyright 1998-2009, SitePoint Pty Ltd. All Rights Reserved