simpleXML Data

RyanReese · March 16, 2015, 5:50pm

This is a more complicated approach than I normally would like, but I have a PHP page I am fopen to read-only. This page is going to be created by a user, and is a news article. I then will have the user go to another page, enter in the URL of the recently-created article, and it will send out a mass e-mail. Please do not bother asking why I do not combine them.

I need to scan the PHP page for certain elements. They all will be within a certain parent HTML element (100% sure about this) although the formatting can obviously differ depending on how hte user creates it. I came across simpleXML and thought this could be useful? Particularly xpath. This appears to be for XML only though?

http://php.net/manual/en/simplexml.examples-basic.php

Is it possible to read my opened file for particular data (HTML)?

Dormilich · March 16, 2015, 5:59pm

It would certainly work if your HTML conforms to the XML rules.

nevertheless, you can load HTML explicitly with DOMDocument. it even has a dedicated method for that.

RyanReese · March 16, 2015, 6:06pm

If by every that you mean every tag has a closing tag, like in XHTML, no I cannot assure that.

As far as the DomDocument goes, yes it looks good: http://php.net/manual/en/domdocument.loadhtmlfile.php

It has this example:

$elements = $doc->getElementsByTagName('div');

How would I go about selecting all children WITHIN a certain HTML element? E.g. how would I select this elements children.

Dormilich · March 16, 2015, 6:13pm

$elements[0]->childNodes

if you have more than one source element, you need a loop. if you know (plain) JavaScript/DOM, that knowledge were of utmost help here.

RyanReese · March 16, 2015, 6:15pm

Yeah I know Javascript; I recognize the above code as Javascript.

I’ll play with it and see what I can come up with.There should be only a few elements within this specific HTML element but I’ll play with it and see what I can do. Thanks.

RyanReese · March 16, 2015, 6:55pm

Is it possible to load the HTML page via an include file? I’m getting an error that the HTML string is needed, not a resource.

  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="url" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="url" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $article = fopen($url, 'r');
    $doc = new DOMDocument();
    $doc->loadHTML($article);
    echo $doc->saveHTML();
    fclose($article);
  }

I need it to pull the HTML from the $article file.

Ignore the less than perfect HTML…stupid CMS “tidying” it up.

Dormilich · March 17, 2015, 8:09am

as stated in the Manual.

you might have overlooked DOMDocument::loadHTMLFile(). nevertheless, an fopen resource ain’t necessary in either case.

RyanReese · March 17, 2015, 11:33am

Ah I was doing fopen AND loadhtmlfile.

I was grabbing the POST value into a varaible. Then fopening that variable (which that fopen was in a variable.)

Then I tried loadHTMLfile from that fopen variable. I needed to c ut out the middle man. My bad. Stupid oversight.

RyanReese · March 17, 2015, 11:38am

I’m getting a completely blank page upon submit. No errors in my log.

if (true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="url" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument();
    $doc->loadHTML($url);
    $elements = $doc->getElementsByTagName('div');

    $tags = $doc->getElementsByTagName('a');

    foreach ($tags as $tag)
    {
      echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
    }
  }
}

Ignore the first “if(true)” part. Took out the condition there for security reasons.

Also, I assume I can do absolute link URLS or relative right? Both don’t work and give me a blank page.

RyanReese · March 17, 2015, 12:23pm

Got it working.

$url=$_POST['URL'];
$doc = new DOMDocument;
$doc->loadHTMLFile($url);
$tags = $doc->getElementsByTagName('a');

foreach ($tags as $tag)
{
  echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}

Previous example had the loadHTMLfile wrongly given.

Dormilich · March 17, 2015, 1:08pm

better use $tag->textContent than $tag->nodeValue.

RyanReese · March 17, 2015, 1:32pm

I’m trying to only begin looping through my data ONLY if it find a parent of page-main-content. From there, I want to select the FIRST h1 that occurs. How can I do that?

if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument;
    $doc->loadHTMLFile($url);

    $xpath=new DomXPath($doc);

    //Find element with class="page-main-content"
    $results=$xpath->query("//*[contains(@class, 'page-main-content')]");
    if ($results->length > 0)
    {
      $links = array();
      foreach($results as $container)
      {
        $arr = $container->getElementsByTagName("a");
        foreach($arr as $item)
        {
          $href =  $item->getAttribute("href");
          $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
          $links[] = array(
            'href' => $href,
            'text' => $text
          );
        }
        for($i=0;$i<sizeof($links);$i++)
        {
          echo $links[$i].text;
        }
      }
    }
  }
}

I was messing with anchors in the above example just trying to get the logic worked out but I’m failing. Thanks in advance.

Dormilich · March 17, 2015, 4:23pm

long time that I have used XPath, try //*[contains(@class, 'page-main-content')]//h1[0]

RyanReese · March 17, 2015, 4:25pm

Sorry I have updated my code. This works so far

<?php
if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument;
    $doc->loadHTMLFile($url);

    $xpath=new DomXPath($doc);

    //Find element with class="page-main-content"
    $results=$xpath->query("//*[contains(@class, 'page-main-content')]");
   
    if (!is_null($results))
    {
      foreach ($results as $element)
      {
        echo "<br/>[". $element->nodeName. "]";

        $nodes = $element->childNodes;
        foreach ($nodes as $node)
        {
          echo $node->nodeValue. "\n";
        }
      }
    }
  }
}
?>

Test page with teh HTML I’m inputting is on http://www.codefundamentals.com/test2.php

The paragraph that says SHOULD NOT BE OUTPUTTED is not outputted when I load this URL from my original test form page. I’m happy so far. This is so over my head.

RyanReese · March 17, 2015, 5:53pm

How can I use normalize space to remove all the empty nodes it’s looping over? I’m getting many random <br> tags in my output due to the HTML white space. I tried putting it on the xpath query but I’m only getting errors. Could you help? I’ve looked at examples but so far nothing has worked for me.

RyanReese · March 17, 2015, 5:55pm

To clarify, I’m trying to strip all white space, and let me, myself format it.

Right now this is my HTML file. I want all breaks removes so it’s one LONG string (unless you have reasons for not wanting me to do that.)

<html>
<head>
<title>My Page</title>
</head>
<body>
<div class="page-main-content">
<h1>h1 test</h1>
<h1>h1 test</h1>
<p><a href="mypage1.html">Hello World!</a></p>
<p><a href="mypage2.html">Another Hello World!</a></p>
</div>
<p>THIS SHOULD NOT BE OUTPUTTED</p>
</body>
</html>

RyanReese · March 17, 2015, 6:15pm

Got it. Dunno if this is optimal though.

<?php
if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL you entered was invalid. Please try again</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=$_POST['URL'];
    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    $doc->loadHTMLFile($url);
    $emailContents=array();
    $xpath=new DomXPath($doc);

    $h1Found=false;

    //Find element with class="page-main-content"
    $results=$xpath->query("//*[contains(@class, 'page-main-content')]");
    if (!is_null($results))
    {
      foreach ($results as $element)
      {
        $nodes = $element->childNodes;
        foreach ($nodes as $node)
        {
          if(trim($node->nodeValue, " \n\r\t\0\xC2\xA0")!=='' && $node->nodeName==='h1' && !$h1Found)
          {
            echo "THIS IS FINDING THE H1-END<br>";
            $h1Found=true;
          }
          elseif(trim($node->nodeValue, " \n\r\t\0\xC2\xA0")!=='')
          {
            echo $node->nodeValue. "<br>";
          }
        }
      }
    }
  }
}
?>

Dormilich · March 17, 2015, 6:52pm

nope.

first mistake is that you assume that $element->childNodes would return elements. it returns nodes. therefore you could simple filter/skip based upon the class name.

but since you’re interested only in <h1>, why looping at all? fetch all h1 tags (didn’t the XPath work?) that there are and use the first one:

$h1 = $element->getElementsByTagName('h1')->item(0);

btw. if there is only one such wrapper element, you wouldn’t even use a loop:

$h1 = $results->item(0)->getElementsByTagName('h1')->item(0);

RyanReese · March 17, 2015, 6:55pm

I will need to loop over all of hte elements in the return set. I estimate about 10 elements I’ll need total that I’ll have to pluck from the data (random P tags, some with classes, some not…an <img>…etc)

Those elements will have classes and what not. I was looking online and I see there isn’t really a getElementsByClassname in DOMDocument. What can I do about that?

I’m afraid I’m not very good in PHP…how would you incorporate your suggestions into my code? I

Dormilich · March 17, 2015, 6:55pm

and to be really mean, you can import DOMNodes into SimpleXML … simplexml_import_dom()