SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Member
    Join Date
    Nov 2010
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    DOMDocument vs Curl plain text

    Hi,

    I'm experimenting with data scraping, and I'm wondering if it's possible to use DOMDocument to load plain text instead of HTML or is CURL the best way to do this? DOMDocument loads images etc when using loadHTMLFile(); which is very slow when you're processing a few pages at the same time. Is there a way to ignore images so they won't slow down the process? I know I can strip tags afterwards but that's after that fact that they've been loaded and have really slowed down the processing.

    Thank you!

  2. #2
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    DomDocument does not "load images", it simply pulls all the html/xml from a page and puts it into a document tree.
    aaron-fisher.com - PHP articles and more

  3. #3
    SitePoint Member
    Join Date
    Nov 2010
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok thanks for that. Why would it be slow to load several pages at once? Is it when I'm parsing the HTML out that the images are loaded?

  4. #4
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by LindenWalsh View Post
    Ok thanks for that. Why would it be slow to load several pages at once? Is it when I'm parsing the HTML out that the images are loaded?
    Probably because the servers you are hitting are slow. The images are never "loaded" - if you are taking the urls of the image links and downloading the files to your server that might take some time, or if you are outputting the images to the browser of course then the browser will need to request each of the images from the server.
    aaron-fisher.com - PHP articles and more

  5. #5
    SitePoint Enthusiast
    Join Date
    Mar 2011
    Posts
    26
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think it would be better to use loadHTML() instead of loadHTMLFile().

    HTML pages often contain MANY errors so you should suppress them.

    Make sure to set appropriate cURL settings to make sure it gets the pages as fast as possible.

    PHP Code:
    $ch curl_init();

    curl_setopt($chCURLOPT_USERAGENT$_SERVER['HTTP_USER_AGENT']);
    curl_setopt($chCURLOPT_URL$url);
    curl_setopt($chCURLOPT_REFERER$url);
    curl_setopt($chCURLOPT_ENCODING'gzip,deflate');
    curl_setopt($chCURLOPT_FAILONERRORtrue);
    curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
    curl_setopt($chCURLOPT_RETURNTRANSFERtrue);
    curl_setopt($chCURLOPT_HEADERtrue);
    curl_setopt($chCURLOPT_NOBODYfalse);
    curl_setopt($chCURLOPT_TIMEOUT30);
    curl_setopt($chCURLOPT_COOKIEFILE'cookie.txt');

    if ( ! 
    $html curl_exec($ch))  
    {                
        echo 
    curl_error($ch).'<pre>'.print_r(curl_getinfo($ch), true).'</pre>';
    }
    else
    {             
        
    curl_close($ch);
        
        
    $dom = new DOMDocument;
        
        if ( @ 
    $dom->loadHTML($html)) // suppress warning errors about invalid HTML
        



Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •