SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    SitePoint Evangelist vhogarth's Avatar
    Join Date
    Nov 2003
    Location
    Taxachussets
    Posts
    415
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Is this the msot efficient way to scrape a page?

    hey Guys,

    I have to write a script to grab a title from a remote page. The script works, but it takes waaay too long. Right now the scrpits runs through a loop of 240 entries pulling at this page. is there a way that i can limit the total capture of data to limit. I know where the keyword is located on the page (near the top) so i dont need to capture the rest. here's what i have that works:

    while($i<250){

    $ID=$row['ID'];

    $url = "site.php?ID=$ID";
    $data = implode("", file($url));
    preg_match_all ("/<font size=\"-1\">([^`]*?)<\/font>/", $data, $matches);

    $title=$matches[0][0];

    }

    As you can tell the data i need to grab is between the <font size="-1"> tag and thats it. is there a way to make preg_match_all stop after teh first occurance? Thanks

  2. #2
    SitePoint Enthusiast spamonkey8's Avatar
    Join Date
    Feb 2006
    Posts
    98
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    For the page load itself, the CURL extension is far faster than fopen. I highly recommend it.

    As far as the scraping, strpos() and substr() have much less overhead than a regular expressions library, and such a simple task is a waste. Regex was really made for more complicated things.

  3. #3
    SitePoint Wizard holmescreek's Avatar
    Join Date
    Mar 2001
    Location
    Northwest Florida
    Posts
    1,707
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It looks like your pulling values from a MySQL result query. Why not add to your sql query something like :

    "select id,name from mytable where xyz=zyx LIMIT 250"

    Just note the LIMIT 250.
    intragenesis, llc professional web & graphic design

  4. #4
    SitePoint Evangelist vhogarth's Avatar
    Join Date
    Nov 2003
    Location
    Taxachussets
    Posts
    415
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by holmescreek
    It looks like your pulling values from a MySQL result query. Why not add to your sql query something like :

    "select id,name from mytable where xyz=zyx LIMIT 250"

    Just note the LIMIT 250.

    I limited it during the loop for the hell of it. I dont want to limit the result set from mysql because im actually doing somethign else once it reaches 250+. I have to split the results into another file..

  5. #5
    SitePoint Evangelist vhogarth's Avatar
    Join Date
    Nov 2003
    Location
    Taxachussets
    Posts
    415
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by spamonkey8
    For the page load itself, the CURL extension is far faster than fopen. I highly recommend it.

    As far as the scraping, strpos() and substr() have much less overhead than a regular expressions library, and such a simple task is a waste. Regex was really made for more complicated things.
    Do I have to download and install the curl library? I've never used curl before so i'm clueless. how would i use strpos() and substr() to get it? Hmm.. do a search for the first string find its position, then search for the second string find its position, then parse the stuff in betwee those two locations?

  6. #6
    SitePoint Wizard holmescreek's Avatar
    Join Date
    Mar 2001
    Location
    Northwest Florida
    Posts
    1,707
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I re-read your original post where you are getting titles from "remote" pages. I'm assuming your using fopen with a url. i.e. $fp = fopen("http://....

    In this case, it is going to take a while, in addition your going to have to make sure you check that the remote url isn't timing out, so skip it, or keep retrying (if the site goes down it will be stuck in a loop).

    I would use a cron tab, that runs nightly to fetch the titles from the remote pages.

    Finally, your on the right track, you could read say 4096K from the file then use a regex to extract the contents between the <title>Hello World</title> tags. However, you have to keep in mind that some sites have a lot of overhead java, etc, so you to be very safe you should keeping doing your fread until </title> has been read, then process the string. Faster than reading the entire file.
    intragenesis, llc professional web & graphic design


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •