|
|||||||
New to SitePoint Forums? Register here for free!
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#1 |
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
How do Webcrawlers work?
Hello everyone. This is my first post on SitePoint. I wasn't sure where to put this thread.
Can someone please explain how Webcrawlers work? Can Google, for example, see every directory and file in your Web Root? Sincerely, Mike |
|
|
|
|
|
#2 |
|
Team SitePoint
![]() Join Date: Apr 2007
Posts: 1,058
|
Hi Mike, Webcrawers are used by third parties to collect information about each page on your site. Predominately for search engines to try and match pages on your site with what people are searching for.
Putting it simply, a crawler will start on your home page, index that, then move to pages you've linked to from your front page, index that, then got to pages you've linked to and so on ... You can control what pages are indexed and not through the robots.txt file or on each individual page with some code as well as control specific crawlers are allowed/not allowed to crawl a site. ie you might not want Google to crawl your site. You'll probably hear about things like adding a 'site map' to ensure all pages are crawled, but these days crawlers and smarter and a well structured site won't need an map to get the whole site indexed. If you want to see what a crawler sees, turn of CSS, images and JavaScript and browse your site. Are you having trouble getting your site indexed in search engines or was it more general interest? |
|
|
|
|
|
#3 | |
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Quote:
To answer your question, I guess my question is a bit broader, and pertains to a few websites I am working on - some for me and some to help others out for free (e.g. church). Originally I wasn't crazy about letting Google molest - I mean "crawl" - my website projects. However, last week I had a humbling experience... I had been helping build a website for a non-profit organization dear to my heart, and after building a great website - at least for a web newbie like me - I discovered that there are LOTS of dummies out there that will type your URL into a Search Engine Box on Yahoo or Google and have no clue what the "Address Bar" is?! So, unless the website is "indexed" on Google, Yahoo, etc. then nobody knows how to find the website?! ![]() Out ISP is telling us it could take 2 MONTHS before our website gets listed after we manually register with all of the search engines or pay them like $40 to do it for us?! That is the dumbest IT mistake I have ever made?! Now my church and this other organization have to spend all of their time explaining to interested people what an "Address Bar" is, and how how to type the URL in the Address Bar. And how they should NOT be searching when we give them a flyer with the church's URL already on it?! (That is like me giving you my phone number, and you going home and doing a reverse lookup to find my name so you can then do a regular lookup to find the phone number I just gave you?! Man are people dense sometimes...) So anyways... In one case we decided to break down and pay the ISP for their service to get our website "listed" - even though that will take 2 months. However, now that I have to change the Robots.txt file from blocking everyone to opening things up, I am concerned about security issues and that maybe the Spiders, Bots, Crawlers, etc will someone how expose our Church's site and expose it's members info?! Does that make sense? It scares me silly to think that Google might get access to and list our Database Config File or Website Config File or anything else that is sensitive... I have been working with computers most of my life but have never done anything online until now, and I am really worried that I'll do something seemingly innocent that will expose our entire congregation's dataset to the outside world. Sorry for the rant, but that is where my original post was intended to eventually go. ![]() Sincerely, Mike P.S. I hear God really socks it to IT people who compromise the data of his "flock"... ![]() |
|
|
|
|
|
|
#4 |
|
The World is Very Sexy
![]() ![]() Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
|
As Shayne said, if you turn off images, Javascript and stylesheets, you can imagine what a webcrawler sees.
SO... if your website config and database files are THAT open that a webcrawler can link to it... so too is it open to any random person out there! And I'd be more scared of that! If your config files are sufficiently tucked away that any random person cannot access it... neither can the webcrawler access it. So if they are password protected or above your document root, etc. Google isn't the devil. -- Whatever you do, do NOT rely of nofollow links or robots.txt files to be the only barrier blocking your sensitive data! Those nofollow instructions are merely suggestions. Good, rule-abiding webcrawlers, like Google's, will ignore the links you tell them to ignore. Malicious webcrawlers will do what they want, regardless of your instructions. So again, just as you would for human users, password-protect or otherwise secure your sensitive data! |
|
|
|
|
|
#5 |
|
SitePoint Evangelist
![]() ![]() ![]() ![]() Join Date: Jul 2008
Posts: 459
|
Search engine bots crawl from webpage to another webpage through links so it's important check our internal pages and make sure there are no broken links.
|
|
|
|
|
|
#6 |
|
SitePoint Member
Join Date: Sep 2009
Posts: 6
|
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner.This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
|
|
|
|
|
|
#7 |
|
SitePoint Addict
![]() ![]() ![]() Join Date: Jul 2005
Posts: 269
|
good post! i just thought they only visit your site and thats it lol
|
|
|
|
|
|
#8 |
|
Team SitePoint
![]() Join Date: Apr 2007
Posts: 1,058
|
Mike as shaun said you want to make sure all your sensitive data, anything that shouldn't be available to everyone is securely tucked away behind a security wall where a user name and password is required to access.
You can always make only your front page open - so people can find that through a search engine and that's it. Crawlers are getting better at finding pages within a site, regardless is there is links or not. |
|
|
|
|
|
#9 |
|
SitePoint Enthusiast
![]() Join Date: Aug 2009
Posts: 63
|
very helpful post..thanks buddy
|
|
|
|
|
|
#10 | ||||
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Wow! Looks like I opened up a can of worms now!
Quote:
Quote:
(Now I know why I preferred developing in a Client/Server environment.)Right now, my church and the non-profit are strapped for cash and so I got them "Shared Hosting" accounts at GoDaddy. Everyone at GoDaddy tells me that shared hosting is perfectly safe if we aren't doing credit card processing, and some have even said that a lot of people used shared hsting for e-commerce. These website won't involve $$$ at this point, but they will contain *sensitive* information like Member's Names, Address, E-mails, Passwords, and so on. And personally I think any information people trust you with should be protected, because I'd be mad as can be if someone gave out my contact info online! I am *hoping* that I can build a reasonably secure website - using our shared hosting account - for church members to create accounts and register for upcoming events and service projects. ![]() One big problem with a Shared Account is that all you get is a "Web Root" and cannot go above that. GoDaddy said that as long as I obscure where I put my Database_Connection file, I should be okay, although I know the PHP/MySQL books I have read suggest putting Database Connection and Website Config files outside the Web Root. Back on topic... So how do Web Crawlers relate to and possibly jeopardize the following files?? (e.g. "database_connect.php", "config.inc.php", ".htaccess", and "php.ini") Quote:
The more I think about it, the more Web Crawlers do seem evil if they index every single file in your Web Root?! Quote:
I am starting to feel nauseous this morning... I hope this isn't as bad as it is sounding, because I thought I was doing things in a secure manner?! And combined with the fact that we have a MAJOR ISSUE with no one being able to easily locate our websites, I need to get us listed on Google and Yahoo asap!!! Hopefully there is a relatively easy solution here... Sincerely, Mike |
||||
|
|
|
|
|
#11 | |||
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Quote:
I do have to log in to see my files in our GoDaddy account, but I suppose that if you knew the file structure I created, you could easily snoop around by modifying the URL, right? Then again, the book I was studying out of, said that as long as you lace all of your database connection and website configuration settings in a PHP file that a hacker cannot view the contents of these files... (But I am feeling way over my head on all of this?!) ![]() All I know is that the idea of giving "Web Crawlers" permission to map out our entire website and make it public on the Internet sounds really scary!!! (It feels as bad as letting Google Maps post pictures of your Street and House on the Internet for everyone to see?!) ![]() Quote:
Quote:
Man, what a mess... Sincerely, Mike Last edited by MikeTheMechanic; Nov 5, 2009 at 10:54. Reason: Correction |
|||
|
|
|
|
|
#12 | |
|
The World is Very Sexy
![]() ![]() Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
|
They aren't bad. Without them, surfing the 'Net would be near impossible. Indexing pages is the only way today's search engines can know what's out there.
They can only see your parsed PHP... that is, the output of your PHP files. They won't see any of the variables or statements or database connections or anything. Only the output; not the PHP code itself. Same as it would be for any other web-browser. Therefore, just as your hosts re-assured you, if your database connection file doesn't output/echo anything, there'd be nothing for the bots to read. ALSO... if you password protect a folder, your server will not give away any of the contents inside; not to bots nor web-browsers nor anyone (unless they have the password, which they won't). The contents will be off limits. Your server will keep them that way. As far as I know, web-crawlers cannot index your database directly; only the data you output from it via a web-page. But not the database itself. But even if they could, they'd again be stuck without the passwords, which they won't have if you keep 'em safe. Breathe. Don't worry, everything will be fine. Quote:
Shaun |
|
|
|
|
|
|
#13 | |
|
The World is Very Sexy
![]() ![]() Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
|
Quote:
They can read your CSS, but again, most would not execute it as far as I know, nor do they have a reason to. CSS is for aesthetics. Bots don't have sight. They don't care about aesthetics, only data. They can maybe read the data that make up a JPEG or other image file; but again... Bots don't have eyes. They can't look at the image and see what it is. They can assume sometimes, based on your alt tags or the data around, but they can't see pictures. Only data. And PHP, as I said, they can only see what is parsed and output by you. They cannot see your code itself. Your server will not give that up. Your server interprets the code and only gives away the parsed data. |
|
|
|
|
|
|
#14 | ||||
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Quote:
I actually did a test this morning, and what you say seems to be true. If I paste in the URL to say, my database_connection.php file, all that comes up in the browser is a white screen since the file is pure PHP with no associated output. Quote:
1.) Do you think I can password-protect a file in my Web Root since we have a Shared Plan? Is that something that requires root access at the server level, or is it a fairly common thing for a Web Host to let users do in their accounts? 2.) If I password protect a directory call, maybe, "Settings", wouldn't that also stop PHP from accessing it? The logic being if the password locks you and I out, wouldn't it break, say, a PHP "include"? Quote:
Quote:
So one more question... Would it still be prudent to either set up my Robots.txt file to "disregard" directories like... /database /images /includes and/or password-protect them, or is that no necessary? Sincerely, Mike |
||||
|
|
|
|
|
#15 | ||||||||
|
The World is Very Sexy
![]() ![]() Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
|
Quote:
Quote:
It should be in your hosting control panel somewhere. If you can't find it, a quick question to your host's customer service should guide you right to it. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Shaun |
||||||||
|
|
|
|
|
#16 | |||||
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Quote:
Quote:
Quote:
Were you saying that they cannot access the .htaccess and php.ini files directly through the web browser or something else? Quote:
(Well, I guess it would be bad news if someone hacked the .htaccess or php.ini but it sounds like that is harder to do.) Quote:
![]() Sincerely, Mike |
|||||
|
|
|
|
|
#17 | |||||||
|
The World is Very Sexy
![]() ![]() Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
|
Quote:
Quote:
You'll get more website traffic for your customer. More people will see your site. People searching for your church online will be more likely to find it. Repeat after me, "Indexing is good!" Quote:
A bot will get the same message. Quote:
Any web-browser, Firefox, Safari, a webcrawler, will not see those files directly. They can only see what the server serves them. And the server is trained to not serve those files. Quote:
Quote:
Quote:
Shaun |
|||||||
|
|
|
|
|
#18 | |||
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Oops. No, sorry, got my acronyms mixed up!
My ISP at home is AT&T and our church's Web Host is GoDaddy. ![]() Quote:
Quote:
www.GreatChurch.org/php.ini or www.GreatChurch.org/robots.txt all of the file contents are displayed. Not sure how bad that is, but it can't be a good idea telling people how your Web Root is set up?! Quote:
1.) Exposing our Web Root's Directory Structure? 2.) Files that can be "read" when they are browsed (e.g. php.ini, robots.txt, etc)? Sincerely, Mike |
|||
|
|
|
|
|
#19 |
|
SitePoint Zealot
![]() ![]() Join Date: Dec 2008
Location: United Kingdom
Posts: 136
|
Robots.txt shouldn't contain anything important so don't worry if people can see it and as for php.ini it's showing nothing private.
|
|
|
|
|
|
#20 | |||
|
The World is Very Sexy
![]() ![]() Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
|
Quote:
There's nothing really useful to learn from your robots.txt file other than the name of some of your folders maybe. As I'd said before, I'd hope that a robots.txt file isn't the only thing standing in the way of your sensitive data! -- They're a waste of time (robots files). Maybe someone more knowledgeable can correct me, but in my view they are pointless. Quote:
Quote:
The file structure of a server is no secret. That still doesn't benefit someone who has no access. |
|||
|
|
|
|
|
#21 | |||
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Quote:
Quote:
Quote:
![]() I guess it sounds like I can "open the flood gates" to the world tomorrow and use GoDaddy's Site Rank Wizard contraption to help get our website indexed with the major search engines and maybe even ranked down the road. If anyone sees anything that I am doing - or that has been said - as inaccurate or insecure, please speak up. Any coaching on all of this is very welcome! I'm really tired an off to bed now, but I guess tomorrow i will create a basic Robots.txt file just to eliminate over indexing of some directories and then click "submit" with GoDaddy and hope things work out as intended?! ![]() Sincerely, Mike |
|||
|
|
|
|
|
#22 | ||||
|
The World is Very Sexy
![]() ![]() Join Date: Nov 2005
Location: Trinidad
Posts: 2,067
|
Quote:
Quote:
Don't worry, man. It will be fine. Quote:
![]() Good luck! Quote:
Shaun |
||||
|
|
|
|
|
#23 |
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
Thanks for all of the comments, everyone!
Sincerely, Mike |
|
|
|
|
|
#24 |
|
SitePoint Enthusiast
![]() Join Date: Nov 2009
Location: Oregon
Posts: 48
|
You know, I actually have more questions. I am not sure where to post them or if I should just tack them on here since they seem to be related?!
I'll ask here, and if I should start a new thread, just say so, Mods. On the website for my church, and also for this non-profit group, there will definitely be areas that I do not want people to know about or for Web Crawlers to index... I was wondering if it would be easier if I created a directory structure like this in my Web Root... WebRoot WebRoot/Public <--- For everyone to see and to be indexed WebRoot/MembersOnly <--- Registered members and should not be indexed WebRoot/Utilities <--- Admin stuff like "config" files, images, includes, etc. that also should not be indexed Here is my logic... Web pages like Registration Pages, Password Resets, Log-In, User Preferences, and things which are "member-only" areas, do not need to be indexed by Web Crawlers since they are either "utility" type pages that no ne would search for in Google, or that are for only members to see, and the members would already know where to look!! Likewise, "utility" web pages and directories like all of websites images, includes files, configuration files, and Administrator web pages also do not need to be indexed by Google since they really are not for the public's eyes to see. By segmenting all of our websites files into this structure, it would be very easy to change content from "piblic" to "private" and visa-versa as things change. Does that seem like a good approach? (I am very big an spending extra time up front organizing things so down the road the website (or whatever) remains orderly and is easier to scale.) Interested in everyone's feedback! Sincerely, Mike |
|
|
|
|
|
#25 |
|
WEBINSANE MEDIA
![]() ![]() ![]() ![]() ![]() Join Date: Oct 2005
Location: Montenegro
Posts: 770
|
I strongly suggest organizing your web site through Google Webmaster Tools.
|
|
|
|
![]() |
| Bookmarks |
«
Previous Thread
|
Next Thread
»
| Thread Tools | |
| Display Modes | |
|
|
|
All times are GMT -7. The time now is 00:41.












(Now I know why I preferred developing in a Client/Server environment.)

-




Linear Mode
