SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Wizard
    Join Date
    Mar 2002
    Location
    Bristol, UK
    Posts
    2,240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    DOMDocument::LoadHTMLFile problem

    I've been playing around with some code, adapted from the code SilverBulletUK gave me in a previous post, hoping to load an HTML document into a DOMDocument object.

    According to the manual, "Unlike loading XML, HTML does not have to be well-formed to load." -- however, loading this page using the LoadHTMLFile function is throwing the following warning:

    Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Opening and ending tag mismatch: td and font in http://news.google.com/news?ie=utf8&...gs&hl=en&gl=us, line: 25 in /home/.../test4.php on line 5
    Here is my code:

    PHP Code:
    <?php
    if($_SERVER['REQUEST_METHOD'] == 'POST') {
       
       
    $oDOMDoc = new DOMDocument();
       
    $oDOMDoc->loadHTMLFile($_POST['url']);
       
    $oNodeList $oDOMDoc->getElementsByTagName('link')
        or die(
    "No Elements Found");
       
       
    $aLinkHrefs = array();
       foreach (
    $oNodeList as $oLinkNode)
       {
          if( 
    $oLinkNode->hasAttribute('href') )
          {
             if( 
    strlen($oLinkNode->getAttribute('href')) > )
             {
                
    $aLinkHrefs[] = $oLinkNode->getAttribute('href');
             }
          }
       }   
    }

    echo 
    '<form action="" method="post">';
    echo 
    '<input type="text" name="url" id="url" style="width:700px;" />';
    echo 
    '<input type="submit" value="Submit" />';
    echo 
    '<textarea style="width:800px;height:600px;display:block;">';
    if(
    is_array($aLinkHrefs)) { print_r($aLinkHrefs); }
    echo 
    '</textarea>';
    echo 
    '</form>';
    ?>
    Furthermore, the $aLinkHrefs array is empty despite there being two <link> elements with href attributes in the source for the page I'm loading.

    Any help would be much appreciated

  2. #2
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    <?php
    $oNodeList 
    $oDOMDoc->getElementsByTagName('link');
    ?>
    'link' should be the name of the 'tag' to find, which is xHTML, and therefore 'a'.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  3. #3
    SitePoint Wizard
    Join Date
    Mar 2002
    Location
    Bristol, UK
    Posts
    2,240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually I'm looking for these elements:

    HTML Code:
    <link rel="alternate" type="application/rss+xml" title="RSS - sam hastings - Google News " href="http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=us&nolr=1&output=rss">
    <link rel="alternate" type="application/atom+xml" title="ATOM - sam hastings - Google News " href="http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=us&nolr=1&output=atom">

  4. #4
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    I guess I should keep my mouth shut!

    Just having a go at getting the link tags from jQuery.com and getting nowhere too...

    I'm sure I'm missing something!

    EDIT:

    This works fine...
    PHP Code:
    <?php
    $sHTML 
    file_get_contents('http://jquery.com/');
    $oDocument = new DOMDocument();
    @
    $oDocument->loadHTML($sHTML);
    $oNodes $oDocument->getElementsByTagName('link');
    if(
    $oNodes->length 0)
    {
        foreach (
    $oNodes as $oNode)
        {
            if(
    $oNode->hasAttribute('href'))
            {
                echo 
    $oNode->getAttribute('href') . '<br />';
            }
        }
    }
    ?>
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  5. #5
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    As does this:-
    PHP Code:
    <?php
    $sHTML 
    file_get_contents('http://jquery.com/');
    $oDocument = new DOMDocument();
    @
    $oDocument->loadHTML($sHTML);
    $oXPath = new DOMXPath($oDocument);
    $oQuery $oXPath->query("head/link//@href");
    $aLinks = array();
    if(
    $oQuery->length 0)
    {
        foreach (
    $oQuery as $oNode)
        {
            
    $aLinks[] = $oNode->nodeValue;
        }
    }
    print_r($aLinks);
    /*
    Array
    (
        [0] => http://static.jquery.com/files/rocker/css/reset.css
        [1] => http://static.jquery.com/files/rocker/css/screen.css
        [2] => http://jquery.com/blog/feed/
        [3] => http://static.jquery.com/favicon.ico
    )
    */
    ?>
    As you're receiving warnings about improper HTML, it indicates your obtaining your targeted data.

    I'm flummoxed I'm afraid.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  6. #6
    SitePoint Member
    Join Date
    Oct 2008
    Posts
    10
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This issue is not your fault. I've used DomDocs to scan numerous pages and have these warnings thrown all over the place. They occur when there is a syntax error in the target HTML page such as forgetting to close a tag properly. It is simply a warning and some of these tags may not be read properly if not closed correctly etc.

    The script still works and I was able to get the following output:

    PHP Code:
    Array
    (
        [
    0] => http://news.google.com/news?ie=utf8&...gs=&hl=en&gl=us&output=rss
        
    [1] => http://news.google.com/news?ie=utf8&...gs=&hl=en&gl=us&output=atom

    Try the same script on a page that validates (try one in my sig *pats himself on the back for finally finding a use for code validation*) and you'll see that no errors are thrown.

    I would suggest adding the following to the top of your script to avoid the warnings being displayed as they're inevitable as most people don't validate their code.

    PHP Code:
    error_reporting(0); 

  7. #7
    @php.net Salathe's Avatar
    Join Date
    Dec 2004
    Location
    Edinburgh
    Posts
    1,398
    Mentioned
    65 Post(s)
    Tagged
    1 Thread(s)
    It doesn't sit very nicely with me, but I generally temporarily turn off error_reporting when loading in HTML documents.

    PHP Code:
    $ER error_reporting(0);
    $doc DOMDocument::loadHTMLFile('http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl=us');
    error_reporting($ER);

    $xpath = new DOMXPath($doc);
    $links $xpath->query('//link/@href');
    $hrefs = array();
    foreach (
    $links as $link)
    {
        
    $hrefs[] = $link->value;
    }

    var_dump($hrefs); 
    Salathe
    Software Developer and PHP Manual Author.

  8. #8
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Salathe View Post
    It doesn't sit very nicely with me, but I generally temporarily turn off error_reporting when loading in HTML documents.

    PHP Code:
    $ER error_reporting(0);
    $doc DOMDocument::loadHTMLFile('http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl=us');
    error_reporting($ER);

    $xpath = new DOMXPath($doc);
    $links $xpath->query('//link/@href');
    $hrefs = array();
    foreach (
    $links as $link)
    {
        
    $hrefs[] = $link->value;
    }

    var_dump($hrefs); 
    Beaten me to the punch Salathe, that XPath expression was bothering me all the way through my dinner!

    I had to dig a book out to find the proper expression, then low and behold...

    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  9. #9
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just use @ to suppress the warning messages.

  10. #10
    SitePoint Wizard
    Join Date
    Mar 2002
    Location
    Bristol, UK
    Posts
    2,240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    SilverBulletUK, tried both your code samples and they worked fine with the JQuery URL, no such luck with my Google Alerts URL though -- blank page served for the first, empty array generated by the second.

    Salathe, tried yours as well, getting an empty array as well.

    This is confusing the hell out of me!

  11. #11
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    OK, I've sussed it. It appears the markup served by Google is fubar'd.

    With a little coercion from us, we can grab the markup, use PHP to make a best guess by telling the DOMDocument to output it as XML.

    We them load the nice new PHP cleaned markup into a new document for querying, Yay!

    I feel quite shifty indeed.

    PHP Code:
    <?php
    $oDOMDocument 
    = new DOMDocument();
    $oDOMDocument->formatOutput true;
    @
    $oDOMDocument->loadHTML(file_get_contents('http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl=us'));
    $oDOMDocument DOMDocument::loadXML($oDOMDocument->saveXML());
    $oXPath = new DOMXPath($oDOMDocument);
    $links $oXPath->query('//link/@href');
    $hrefs = array();
    foreach (
    $links as $link)
    {
        
    $hrefs[] = $link->value;
    }
    print_r($hrefs);
    /*
    Array
    (
        [0] => http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=us&nolr=1&output=rss
        [1] => http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=us&nolr=1&output=atom
    )
    */
    ?>
    Although, I'm unsure why you're just not appending '&output=rss' or '&output=atom' to the original string....
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •