SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 33

Thread: Php crawler

  1. #1
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Php crawler

    I'm trying to build a php crawler from scratch and made it this far:


    <?php
    include "Snoopy.class.php";
    $snoopy = new Snoopy;

    $snoopy->fetchlinks("http://www.msn.com/");
    print $snoopy->results;

    var_dump($snoopy->results);
    ?>

    This basically prints all links to the browser. Now I need to make the program follow all the internal links to crawl the site and index all the text. If anyone has any ideas on how to do that..

  2. #2
    Non-Member
    Join Date
    Oct 2008
    Posts
    372
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Insert all the links into an array, that way you only insert the link once.

    Then use file_get_contents or cURL. The first option is rather slow if its a big site, assuming you need to search the entire page you might want to go with cURL instead.

  3. #3
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by 9three View Post
    Insert all the links into an array, that way you only insert the link once.

    Then use file_get_contents or cURL. The first option is rather slow if its a big site, assuming you need to search the entire page you might want to go with cURL instead.
    I think the result ($snoopy->results) comes in an array. However I couldn't put it in a database directly for some reason.

    Conserning Curl. Isn't that a comand line tool? I mean can I use it on a website and is it very hard to make it work?

  4. #4
    SitePoint Zealot Gman's Avatar
    Join Date
    Jan 2002
    Location
    Sarasota, FL
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    http://www.php.net/manual/en/function.curl-init.php
    Just have to enable the cURL extension in your ini

  5. #5
    Floridiot joebert's Avatar
    Join Date
    Mar 2004
    Location
    Kenneth City, FL
    Posts
    823
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can save yourself some time if you have access to wget on your server. Let wget worry about recursively crawling the pages, while it's crawling you can work on the part of your application that works with the saved copies of the pages on your server.

  6. #6
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    OK I got this far:

    <?php
    include "Snoopy.class.php";
    $snoopy = new Snoopy;

    $snoopy->fetchlinks("http://www.msn.com/");
    print $snoopy->results;

    $links = $snoopy->results;
    echo $links[0];
    ?>

    Now I need to get the information stored in the array "$links" to a mysql database. When I tried It seemed to be empty, at least it did not insert anything to the database.

    Also when I echo echo $links[0] for example I get something like this: arrayhttp://example.com

    Why do I get the word "array" in the printed result, and how do I prevent all links going in to the database looking like this: arrayhttp://...

  7. #7
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Do var_dump($links) and it will help you determine the structure of the variable.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  8. #8
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by SilverBulletUK View Post
    Do var_dump($links) and it will help you determine the structure of the variable.
    Yes I managed to fix the problem and now I only get a lot of links and not the word "array". I converted the array to a string with "implode".

    Now I want to put all the links in the string in a mysql database but still it doesn't write anyting to the database. I tried this:

    mysql_query ("INSERT INTO testtable(text) VALUES ('$linkstring')");

    However if I change the variable to a word it works fine to put it in the database.

  9. #9
    Non-Member
    Join Date
    Oct 2008
    Posts
    372
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Remove the single quotes from within the VALUES

  10. #10
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by 9three View Post
    Remove the single quotes from within the VALUES
    like this:

    mysql_query ("INSERT INTO testtable(text) VALUES ($linkstring)");

    Still doesn't write anything to the database

  11. #11
    Non-Member
    Join Date
    Oct 2008
    Posts
    372
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Does it through an error? use mysql_error() to check for error messages.

    echo $linkstring, make sure there is something there.

  12. #12
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by 9three View Post
    Does it through an error? use mysql_error() to check for error messages.

    echo $linkstring, make sure there is something there.
    Yes there is a lot of links in the linkstring i tried echoing it... Perhaps I'm doing it wrong with the mysql error but as it is I get not error message.

    This is the whole code I use:

    <?php
    include "Snoopy.class.php";
    $snoopy = new Snoopy;

    $snoopy->fetchlinks("http://www.msn.com/");


    $links = $snoopy->results;
    //echo $links[0];//

    $linkstring = implode(',' , $links);





    $con = mysql_connect("localhost","root","");
    if (!$con)
    {
    die('Could not connect: ' . mysql_error());
    }

    mysql_select_db("one", $con);



    mysql_query("CREATE TABLE testtable(

    text VARCHAR(100), INDEX (text)

    )")
    or die(mysql_error());


    mysql_query ("INSERT INTO testtable(text) VALUES ($linkstring)");
    mysql_error();

    mysql_close($con);

    ?>

  13. #13
    Non-Member
    Join Date
    Oct 2008
    Posts
    372
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    mysql_query ("INSERT INTO testtable(text) VALUES ($linkstring)") or die(mysql_error());

    also escape your data:

    mysql_real_escape_string($linkstring);

  14. #14
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by 9three View Post
    mysql_query ("INSERT INTO testtable(text) VALUES ($linkstring)") or die(mysql_error());

    also escape your data:

    mysql_real_escape_string($linkstring);
    OK thanks

    I get this message now:

    You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'http://www.e24.se/ego/startsidan_s189/http://bors.e24.se/bors24.se/site/overview' at line 1

    Seems like there is one of the links that it can't index for some reason. I'm using varchar in the mysql database, it's a long link but I extended made the varchars much longer which didn't help.

    So I wonder if there may be any special symbols in the link or something like that.. Otherwise would it be possible just to skip that link and move on to the next one if it's just very weird?

  15. #15
    SitePoint Zealot Gman's Avatar
    Join Date
    Jan 2002
    Location
    Sarasota, FL
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    use text instead of varchar type for your field.

  16. #16
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Gman View Post
    use text instead of varchar type for your field.
    you mean like this?

    mysql_query("CREATE TABLE testtable(

    thetext text(600), INDEX (thetext)

    Still seems to be something wrong.. Do you write it diffrent if it's a text instead of varchar?

  17. #17
    Non-Member
    Join Date
    Oct 2008
    Posts
    372
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    change your data type from varchar to varchar(255).

  18. #18
    SitePoint Enthusiast
    Join Date
    May 2007
    Posts
    61
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Write a function & call it recursively...

  19. #19
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    OK I fixed the text part in the database. The problem was that you can't use the text fields as an index like you can with a varchar.

    It still don't work though. I get the same problem with the text field as I get with the varchar.

    I echoed the links and the link that is the problem is the first link to use these symbols ''
    like this: 'http://example.com/example'

    The errorcode is: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'http://www.e24.se/ego/startsidan_s189/http://bors.e24.se/bors24.se/site/overview' at line 1

    Anyone know a solution for this problem?

  20. #20
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by theblackjacker View Post
    OK I fixed the text part in the database. The problem was that you can't use the text fields as an index like you can with a varchar.

    It still don't work though. I get the same problem with the text field as I get with the varchar.

    I echoed the links and the link that is the problem is the first link to use these symbols ''
    like this: 'http://example.com/example'

    The errorcode is: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'http://www.e24.se/ego/startsidan_s189/http://bors.e24.se/bors24.se/site/overview' at line 1

    Anyone know a solution for this problem?
    I added
    mysql_real_escape_string($linkstring) or die(mysql_error());
    but that didn't help either

  21. #21
    SitePoint Zealot Gman's Avatar
    Join Date
    Jan 2002
    Location
    Sarasota, FL
    Posts
    154
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    can you paste your class

  22. #22
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Gman View Post
    can you paste your class
    I obviously haven't built the snoopy class. It's a lot of code but you can get it here: http://sourceforge.net/projects/snoopy/

  23. #23
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    That error message is the generic one that flies out of mysql when you fail to construct a correct sql query statement.

    Try doing this:
    PHP Code:
    $sql "INSERT INTO testtable (`text`) VALUES ('" .mysql_real_escape_string($linkstring) ."')" ;

    mysql_query $sql ) or die(mysql_error());

    // a line of debug
    echo $sql 
    Now you can see the sql statement printed onto your page, so now you can paste that into PhpMyAdmin or whatever you use, and then debug that statement and figure out where you are going wrong from there.

    You are using a reserved word as a table column name too, so I showed you how to quote it correctly to get round that problem.

  24. #24
    SitePoint Wizard Blake Tallos's Avatar
    Join Date
    Jun 2008
    Location
    Cuyahoga Falls, Ohio.
    Posts
    1,511
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    What is a PHP Crawler? Can anyone explain to me what it is?
    Blake Tallos - Software Engineer for Sanctuary
    Software Studio, Inc. C# - Fanatic!
    http://www.sancsoft.com/


  25. #25
    SitePoint Member
    Join Date
    Dec 2005
    Posts
    24
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Cups View Post
    That error message is the generic one that flies out of mysql when you fail to construct a correct sql query statement.

    Try doing this:
    PHP Code:
    $sql "INSERT INTO testtable (`text`) VALUES ('" .mysql_real_escape_string($linkstring) ."')" ;

    mysql_query $sql ) or die(mysql_error());

    // a line of debug
    echo $sql 
    Now you can see the sql statement printed onto your page, so now you can paste that into PhpMyAdmin or whatever you use, and then debug that statement and figure out where you are going wrong from there.

    You are using a reserved word as a table column name too, so I showed you how to quote it correctly to get round that problem.

    You did it! Now at least it writes to the database.

    However everything goes to the same row. It was an array which is transferred (with implode) to a string where all links are sepparated by commas (,).

    I just need each link to go to each row in the database and then I should be able to build something to crawl them


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •