SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Evangelist spinmaster's Avatar
    Join Date
    Mar 2005
    Posts
    456
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    spider crawling & php-include files?

    Hi,

    how does a spider for SEO access my .php-pages, in particular, how does it handle .php-inlucde files in my pages??

    For example, let's say I am including my header, footer, navigation, etc. via php-includes. Does this really affect the outcome of spider crawling and the overall ranking of my pages?

  2. #2
    SitePoint Guru MikeBigg's Avatar
    Join Date
    Jun 2004
    Location
    Reading, UK
    Posts
    970
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I put my include files in a separate folder which I exclude from the attentions of the well-behaved robots with the robots.txt file.

    I guess they could be placed in a folder which is higher up the tree than public_html thus making them inaccessible to the robots and everyone else.

    Mike

  3. #3
    He's No Good To Me Dead silver trophybronze trophy stymiee's Avatar
    Join Date
    Feb 2003
    Location
    Slave I
    Posts
    23,424
    Mentioned
    2 Post(s)
    Tagged
    1 Thread(s)
    Spiders don't see PHP. They only the the HTML it produces.

  4. #4
    SitePoint Guru MikeBigg's Avatar
    Join Date
    Jun 2004
    Location
    Reading, UK
    Posts
    970
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by stymiee
    Spiders don't see PHP. They only the the HTML it produces.
    Indeed, but if the html they produce is just footer or header or column information, or if the included files are templates with tags like <##page-title##> in, is this a good thing or a bad thing.

    In my opinion templates and includes should not be indexed by the search engines, no matter how their content is produced.

    Mike

  5. #5
    He's No Good To Me Dead silver trophybronze trophy stymiee's Avatar
    Join Date
    Feb 2003
    Location
    Slave I
    Posts
    23,424
    Mentioned
    2 Post(s)
    Tagged
    1 Thread(s)
    Quote Originally Posted by MikeBigg
    Indeed, but if the html they produce is just footer or header or column information, or if the included files are templates with tags like <##page-title##> in, is this a good thing or a bad thing.

    In my opinion templates and includes should not be indexed by the search engines, no matter how their content is produced.

    Mike
    It's a good thing as a webmaster should want all of their content indexed.

    How do you propose to stop them from indexing this content if it is all prodcuded server-side?

  6. #6
    SitePoint Guru MikeBigg's Avatar
    Join Date
    Jun 2004
    Location
    Reading, UK
    Posts
    970
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, the content in include files would be indexed when it is included with a page.

    Generally the include files are only partial pages, for example, only the bottom part of the page which won't have any reference to style sheets and certainly won't be a complete html document in its own right. I wouldn't want a visitor to arrive at that part of my site having found that in a search engine.

    As for stopping robots indexing the include files ... I already answered this in my first post in this thread. The well behaved robots will not index the include files if I put them in a folder and ask them not to index that folder using robots.txt. If I really didn't want them to index those files I could put them out side of the public html area of the servers directory structure so they can't get to them.

    Mike

  7. #7
    He's No Good To Me Dead silver trophybronze trophy stymiee's Avatar
    Join Date
    Feb 2003
    Location
    Slave I
    Posts
    23,424
    Mentioned
    2 Post(s)
    Tagged
    1 Thread(s)
    Search engiones won't crawl standalone include files because they can't find them. Since they are included on the server-side the bots never see the code and thus their location.

    If you are worried about prying eyes finding your includes, which is a much more valid concern, you should put them into a directory below your root webdirectory. That way they cannot be called through a web browser or other similiar means. Putting your includes in a public directory and then putting that directory into your robots.txt is like inviting hackers to your private information (many use includes for their database connect info and this usually contains logins and passwords).

  8. #8
    SitePoint Guru MikeBigg's Avatar
    Join Date
    Jun 2004
    Location
    Reading, UK
    Posts
    970
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Agreed.

    Mike

  9. #9
    SitePoint Evangelist spinmaster's Avatar
    Join Date
    Mar 2005
    Posts
    456
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hey,

    thx for all your replies so far! This Forum is absolutely great!

    Someone else mentioned this in another forum:

    Bots and spiders read your server's HTML-format output. They don't care and can't detect how you construct that output: you can include() and require() as many files as you like.

    So I take it that only the pure HTML-output is important for SE-spiders...

  10. #10
    SitePoint Guru MikeBigg's Avatar
    Join Date
    Jun 2004
    Location
    Reading, UK
    Posts
    970
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by spinmaster
    Bots and spiders read your server's HTML-format output. They don't care and can't detect how you construct that output: you can include() and require() as many files as you like.

    So I take it that only the pure HTML-output is important for SE-spiders...
    They are able to process any text output, whether it is properly formed as html or not. Google and probably others will also index pdf, word and, I believe, flash files too.

    Mike

  11. #11
    SitePoint Addict Duilen's Avatar
    Join Date
    Jun 2004
    Location
    Mountain View
    Posts
    254
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by MikeBigg
    Well, the content in include files would be indexed when it is included with a page.

    Generally the include files are only partial pages, for example, only the bottom part of the page which won't have any reference to style sheets and certainly won't be a complete html document in its own right. I wouldn't want a visitor to arrive at that part of my site having found that in a search engine.

    As for stopping robots indexing the include files ... I already answered this in my first post in this thread. The well behaved robots will not index the include files if I put them in a folder and ask them not to index that folder using robots.txt. If I really didn't want them to index those files I could put them out side of the public html area of the servers directory structure so they can't get to them.

    Mike
    I really don't think it matters if you put your include files on the moon. Bots should not be able to distinguish the content produced from the includes from the rest of your content since all that is taking place server side. Therefore they should index the content even if it is located in a folder that you have told them not to crawl through a robots.txt file.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •