Issues with Simple DOM HTML parse library

So I’m trying to learn the Simple DOM HTML Parse library. They have a tutorial ( http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/ ), which might be relevant to the issue because I pretty much copy/pasted the source code I downloaded from that site.

I’m trying to parse through the example website they have (the URL is in the code file if you need to see it), and when I run the code it gives me this error:


Notice: Trying to get property of non-object in C:\\xampp\\htdocs\\parse\\index.php on line 30

Fatal error: Call to a member function first_child() on a non-object in C:\\xampp\\htdocs\\parse\\index.php on line 31

Lines 30-31:

$articles[] = array($post->children(3)->outertext, //line 30 error here
                $post->children(6)->first_child()->outertext); //line 31 error here

So it’s saying post or children(3) isn’t an object. Now, because I copy/pasted this code from the tutorial, I suspect what is going on is that whatever is at children(3) might not exist or something to that effect. But I don’t know how to begin troubleshooting for stuff like that. I feel like if I can figure this out, I can mess around with the tutorial and figure out how to parse through other pages. Eventually I’m wanting to use this library to make a PHP program that will parse through NBA.com box scores to grab all the stats.

index.php

<?php

    # don't forget the library
    include('simple_html_dom.php');

    # this is the global array we fill with article information
    $articles = array();

    # passing in the first page to parse, it will crawl to the end
    # on its own
    getArticles('http://net.tutsplus.com/page/78/');



function getArticles($page) {
    global $articles, $descriptions;

    $html = new simple_html_dom();
    $html->load_file($page);

    $items = $html->find('div[class=preview]');

    foreach($items as $post) {
        # remember comments count as nodes
        $articles[] = array($post->children(3)->outertext, //line 30 error here
                $post->children(6)->first_child()->outertext); //line 31 error here
    }

    # lets see if there's a next page
    if($next = $html->find('a[class=nextpostslink]', 0)) {
        $URL = $next->href;
        echo "going on to $URL <<<\
";
        # memory leak clean up
        $html->clear();
        unset($html);

        getArticles($URL);
    }
}

?>


<html>
<head>
    <style>
        #main {
            margin: 80px auto;
            width: 600px;
        }
        h1 {
            font: bold20px/30px verdana, sans-serif;
            text-decoration: none;
        }
        p {
            font: 10px/14px verdana, sans-serif;
    </style>
</head>
<body>
    <div id="main">
<?php
    foreach($articles as $item) {
        echo $item[0];
        echo $item[1];
    }
?>
    </div>
</body>
</html>

Hi Jeff,

Since that tutorial was published, nettuts have changed the layout of their article listing page, so when the script tries to parse http://net.tutsplus.com/page/78/ the markup is different than what it expects.


$articles[] = array($post->children(3)->outertext, //line 30 error here
                $post->children(6)->first_child()->outertext); //line 31 error here

The line $post->children(3)->outertext is supposed to get the title of each article, but the title is now the first child not the third. The next line should get the article preview, but the article list no longer includes preview text for some reason.

Ok, so just comment out the second part and change the first index to 0?

I had a feeling it was something like that because I didn’t see comments or anything like the tutorial suggests.

Would you perhaps know of other good tutorials or perhaps even suggest books on the subject? I might be able to figure it out without more help, but obviously web scraping is going to be a big part of my project.

The Simple HTML DOM docs are actually pretty good. There are plenty of examples, so it should get you headed in the right direction.