SitePoint Sponsor

User Tag List

Results 1 to 2 of 2
  1. #1
    SitePoint Zealot
    Join Date
    Jun 2006
    Posts
    186
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Using libcurl to scrape contents of each link.

    I've used libcurl to screen scrape before, but what I need to do is use libcurl and regex to match specific links on a page and then I need to scrape content from the pages that those links link to. Hope that makes sense. Is it possible to embed libcurl within another libcurl. If anyone can point me to an example of this that would be much appreciated.

    So my scraping php file generally starts with:
    $ch = curl_init("http://www.thewebsite.com/");
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
    $response = curl_exec($ch);
    curl_close($ch);

    But I won't know the addresses that the links on this website point to. So I'd have to scrape those and then feed them into another libcurl for each link (and there could be any number of links). Thanks in advance!

  2. #2
    ✯✯✯ silver trophybronze trophy php_daemon's Avatar
    Join Date
    Mar 2006
    Posts
    5,284
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    //...
    preg_match_all('#<a href="(.+?)">.+?</a>#i',$response,$matches);

    foreach(
    $matches[1] as $url){
      
    $ch curl_init($url);
      
    //get and process the contents of $url
      //...

    Saul


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •