SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Enthusiast
    Join Date
    Apr 2007
    Posts
    63
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    what would be the best way to code this?

    hello everyone,

    I am trying to find a best way to achieve this:
    I need to create a web crawler that constantly crawls a page, picks the links, crawls those links before further crawling more links. I was wondering if PHP would be a good choice to accomplish to this. Can a PHP script be coded to run forever (may be by setting the timeout to infinite)? is it going to be a good idea? are there better options for this kind of thing?

    LAMP

  2. #2
    SitePoint Zealot
    Join Date
    Mar 2007
    Posts
    196
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    cron job to execute PHP script every X amount of time, use file_get_contents to get contents of page to crawl, use link regex to get links out of contents (have a look here http://regexlib.com/Search.aspx?k=link), then with new links put them in a database. The cron job could run for each link in the database and then remove the link after the script is done, putting more links in the database. (basically a queue that continues to add more links as links are removed) would basically continue to crawl without crawling too much at one time (limit number of links to crawl at a time - all automated without crashing your server). I have never tried this but I believe this would probably be a good way to do it. Actually don't remove links from the database just have a column called date, and when you crawl the link you can put a timestamp of when it was crawled. This way you don't end up crawling links over and over again (write script to achieve this), and you could also write the script so that if it has been over X amount of time since a link has been crawled, crawl it again.
    Kayzio - We don't hesitate, we accelerate.

  3. #3
    SitePoint Evangelist Andrewaclt's Avatar
    Join Date
    Dec 2003
    Location
    Raleigh, NC
    Posts
    535
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I personally wouldn't use php for this, I would go with perl or python. There are a lot of classes to do a lot of the hard work for you for these languages all over the internet.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •