SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Evangelist
    Join Date
    Jan 2005
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    "Scraping content" from my other standalone site

    I am wanting to scrape some specific content from my other HTML site.

    I am wanting all the code, between and including a DIV with class = "recommended".

    I have done the following:
    PHP Code:
    $target_url "http://www.mysite.com";
    ..
    $ch curl_init();
    curl_setopt($chCURLOPT_USERAGENT$userAgent);
    curl_setopt($chCURLOPT_URL,$target_url);
    curl_setopt($chCURLOPT_FAILONERRORtrue);
    curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
    curl_setopt($chCURLOPT_AUTOREFERERtrue);
    curl_setopt($chCURLOPT_RETURNTRANSFER,true);
    curl_setopt($chCURLOPT_TIMEOUT10);
    $htmlcurl_exec($ch);
    if (!
    $html) {
        echo 
    "<br />cURL error number:" .curl_errno($ch);
        echo 
    "<br />cURL error:" curl_error($ch);
        exit;
    }


    $dom = new DOMDocument();
    @
    $dom->loadHTML($html);
    foreach(
    $dom->getElementsByTagName('div') as $div) {
    if(
    $div->getAttribute('class') == "recommended") {
            
    $recommended_div $div->nodeValue."";
    }
    }
    echo 
    $recommended_div//Just outputs ALL the scraped text, but all HTML tags are stripped. (Which I don't want to happen). I want all text and HTML tags to be in place. 
    However, this just gets the visible text, and does not fetch HTML tags that are within the DIV, such as links, other divs, images, etc.

    How can I get ALL the content that is within the DIV (including the DIV itself)? and not just the text.

    NB. The DIV called "recommended" has a bunch of internal DIVs as well. so I cant work out how to do it with REGULAR EXPRESSIONS either.
    PHP Code:
    preg_match_all('/<div class="recommended">(.*?)<\/div>/s',...... 
    ...fails due to the nested divs.
    Last edited by LuckyB; May 6, 2008 at 19:40.

  2. #2
    SitePoint Evangelist catweasel's Avatar
    Join Date
    Apr 2007
    Location
    Goldfields, VIC, Australia
    Posts
    518
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Seems like a lot of trouble for something that could be done any number of other ways. Since both sites are yours you could, at the most basic level, just copy and paste the content. If the content is updated frequently both sites could pull the same content from the same database. You could have that particular section of content in a seperate include file which both sites could grab. You could deliver the content as a xml/web service so both sites could treat the content in their own way.

  3. #3
    SitePoint Evangelist
    Join Date
    Jan 2005
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Both sites are mine, but on different servers, etc, and yes the content is dynamic.

    I cant get the code any other way than scraping. It isnt just the data I want either...so can't just connect to the DB and pull it..I want the formatting etc as well. which is done with the template...

    so scraping is the only way.

  4. #4
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    You will have to look carefully through the doc looking for some other unique identifier which marks the end of the nested divs.

    Seeing as you are the one generating that content why can't you output some unique marker into the html of the target site? like :

    <!--ENDOFRECOMMENDEDDIV-->


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •