simpleXML Data

This is a more complicated approach than I normally would like, but I have a PHP page I am fopen to read-only. This page is going to be created by a user, and is a news article. I then will have the user go to another page, enter in the URL of the recently-created article, and it will send out a mass e-mail. Please do not bother asking why I do not combine them.

I need to scan the PHP page for certain elements. They all will be within a certain parent HTML element (100% sure about this) although the formatting can obviously differ depending on how hte user creates it. I came across simpleXML and thought this could be useful? Particularly xpath. This appears to be for XML only though?

http://php.net/manual/en/simplexml.examples-basic.php

Is it possible to read my opened file for particular data (HTML)?

It would certainly work if your HTML conforms to the XML rules.

nevertheless, you can load HTML explicitly with DOMDocument. it even has a dedicated method for that.

1 Like

If by every that you mean every tag has a closing tag, like in XHTML, no I cannot assure that.

As far as the DomDocument goes, yes it looks good: http://php.net/manual/en/domdocument.loadhtmlfile.php

It has this example:

$elements = $doc->getElementsByTagName('div');

How would I go about selecting all children WITHIN a certain HTML element? E.g. how would I select this elements children.

$elements[0]->childNodes

if you have more than one source element, you need a loop. if you know (plain) JavaScript/DOM, that knowledge were of utmost help here.

Yeah I know Javascript; I recognize the above code as Javascript.

I’ll play with it and see what I can come up with.There should be only a few elements within this specific HTML element but I’ll play with it and see what I can do. Thanks.

Is it possible to load the HTML page via an include file? I’m getting an error that the HTML string is needed, not a resource.

  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="url" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="url" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $article = fopen($url, 'r');
    $doc = new DOMDocument();
    $doc->loadHTML($article);
    echo $doc->saveHTML();
    fclose($article);
  }

I need it to pull the HTML from the $article file.

Ignore the less than perfect HTML…stupid CMS “tidying” it up.

as stated in the Manual.

you might have overlooked DOMDocument::loadHTMLFile(). nevertheless, an fopen resource ain’t necessary in either case.

Ah I was doing fopen AND loadhtmlfile.

I was grabbing the POST value into a varaible. Then fopening that variable (which that fopen was in a variable.)

Then I tried loadHTMLfile from that fopen variable. I needed to c ut out the middle man. My bad. Stupid oversight.

I’m getting a completely blank page upon submit. No errors in my log.

if (true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="url" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument();
    $doc->loadHTML($url);
    $elements = $doc->getElementsByTagName('div');

    $tags = $doc->getElementsByTagName('a');

    foreach ($tags as $tag)
    {
      echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
    }
  }
}

Ignore the first “if(true)” part. Took out the condition there for security reasons.

Also, I assume I can do absolute link URLS or relative right? Both don’t work and give me a blank page.

Got it working.

$url=$_POST['URL'];
$doc = new DOMDocument;
$doc->loadHTMLFile($url);
$tags = $doc->getElementsByTagName('a');

foreach ($tags as $tag)
{
  echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}

Previous example had the loadHTMLfile wrongly given.

better use $tag->textContent than $tag->nodeValue.

I’m trying to only begin looping through my data ONLY if it find a parent of page-main-content. From there, I want to select the FIRST h1 that occurs. How can I do that?

if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument;
    $doc->loadHTMLFile($url);

    $xpath=new DomXPath($doc);

    //Find element with class="page-main-content"
    $results=$xpath->query("//*[contains(@class, 'page-main-content')]");
    if ($results->length > 0)
    {
      $links = array();
      foreach($results as $container)
      {
        $arr = $container->getElementsByTagName("a");
        foreach($arr as $item)
        {
          $href =  $item->getAttribute("href");
          $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
          $links[] = array(
            'href' => $href,
            'text' => $text
          );
        }
        for($i=0;$i<sizeof($links);$i++)
        {
          echo $links[$i].text;
        }
      }
    }
  }
}

I was messing with anchors in the above example just trying to get the logic worked out but I’m failing. Thanks in advance.

long time that I have used XPath, try //*[contains(@class, 'page-main-content')]//h1[0]

Sorry I have updated my code. This works so far

<?php
if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument;
    $doc->loadHTMLFile($url);

    $xpath=new DomXPath($doc);

    //Find element with class="page-main-content"
    $results=$xpath->query("//*[contains(@class, 'page-main-content')]");
   
    if (!is_null($results))
    {
      foreach ($results as $element)
      {
        echo "<br/>[". $element->nodeName. "]";

        $nodes = $element->childNodes;
        foreach ($nodes as $node)
        {
          echo $node->nodeValue. "\n";
        }
      }
    }
  }
}
?>

Test page with teh HTML I’m inputting is on http://www.codefundamentals.com/test2.php

The paragraph that says SHOULD NOT BE OUTPUTTED is not outputted when I load this URL from my original test form page. I’m happy so far. This is so over my head.

How can I use normalize space to remove all the empty nodes it’s looping over? I’m getting many random <br> tags in my output due to the HTML white space. I tried putting it on the xpath query but I’m only getting errors. Could you help? I’ve looked at examples but so far nothing has worked for me.

To clarify, I’m trying to strip all white space, and let me, myself format it.

Right now this is my HTML file. I want all breaks removes so it’s one LONG string (unless you have reasons for not wanting me to do that.)

<html>
<head>
<title>My Page</title>
</head>
<body>
<div class="page-main-content">
<h1>h1 test</h1>
<h1>h1 test</h1>
<p><a href="mypage1.html">Hello World!</a></p>
<p><a href="mypage2.html">Another Hello World!</a></p>
</div>
<p>THIS SHOULD NOT BE OUTPUTTED</p>
</body>
</html>

Got it. Dunno if this is optimal though.

<?php
if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    $doc->loadHTMLFile($url);
    $emailContents=array();
    $xpath=new DomXPath($doc);

    $h1Found=false;

    //Find element with class="page-main-content"
    $results=$xpath->query("//*[contains(@class, 'page-main-content')]");
    if (!is_null($results))
    {
      foreach ($results as $element)
      {
        $nodes = $element->childNodes;
        foreach ($nodes as $node)
        {
          if(trim($node->nodeValue, " \n\r\t\0\xC2\xA0")!=='' && $node->nodeName==='h1' && !$h1Found)
          {
            echo "THIS IS FINDING THE H1-END<br>";
            $h1Found=true;
          }
          elseif(trim($node->nodeValue, " \n\r\t\0\xC2\xA0")!=='')
          {
            echo $node->nodeValue. "<br>";
          }
        }
      }
    }
  }
}
?>

nope.

first mistake is that you assume that $element->childNodes would return elements. it returns nodes. therefore you could simple filter/skip based upon the class name.

but since you’re interested only in <h1>, why looping at all? fetch all h1 tags (didn’t the XPath work?) that there are and use the first one:

$h1 = $element->getElementsByTagName('h1')->item(0);

btw. if there is only one such wrapper element, you wouldn’t even use a loop:

$h1 = $results->item(0)->getElementsByTagName('h1')->item(0);

I will need to loop over all of hte elements in the return set. I estimate about 10 elements I’ll need total that I’ll have to pluck from the data (random P tags, some with classes, some not…an <img>…etc)

Those elements will have classes and what not. I was looking online and I see there isn’t really a getElementsByClassname in DOMDocument. What can I do about that?

I’m afraid I’m not very good in PHP…how would you incorporate your suggestions into my code? I

and to be really mean, you can import DOMNodes into SimpleXML … simplexml_import_dom()