simpleXML Data

RyanReese · March 17, 2015, 6:55pm

Dormilich · March 17, 2015, 6:56pm

that’s because classes do not exist in XML. and DOM targets XML as well as HTML.

RyanReese · March 17, 2015, 6:59pm

Yeah I only will be needing to loop through HTML. With the knowledge that I still need about 9 more elements to pull, do you recommend looping still?

[quote=“Dormilich, post:18, topic:115635”]
(didn’t the XPath work?)
[/quote]Couldn’t get it working due to my noobness. How does your xpath factor into my code I have now? How can I rewrite it to remove hte unneeded loops and be optimal?

RyanReese · March 17, 2015, 7:00pm

I cannot guarentee there only being one such wrapper. User uses a WYSIWYG and who knows…

Mittineague · March 17, 2015, 7:27pm

If you have Windows, even if you don’t use the SDK the chm in it is a valuable reference for things XML

vvv download page vvv
https://www.microsoft.com/en-us/download/details.aspx?id=3988

Dormilich · March 17, 2015, 7:43pm

it depends which elements you need, up till now only the h1 was mentioned.

RyanReese · March 17, 2015, 8:11pm

I figured once I get one element, I can sort of use the same logic to get the other elements.

The number of elements can vary depending on how many paragraph tags there are. I’ll need at least the h1, img, and a handful of paragraph tags. Also 1 span.

You see now in my current code I use if() to determine if the header matches my criteria. I was going to just add conditions to match what I need. Maybe switches.

RyanReese · March 17, 2015, 8:12pm

@Mittineague , I’m afraid I’m very restricted here at work. That will not be of help to me I’m afraid. It looks like I have to download that ot use it.

RyanReese · March 18, 2015, 3:42pm

Update: I have this so far, which finds the first h1 and sets it into a variable.

I do have a question though, how can I use the getElementsByTagName (or something similar) as part of an IF condition? I need to do find if the certain element I’m using in my loop is a certain tag name, and then I need to add it as part of a variable

<?php
error_reporting(E_ALL);
if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="email">Enter the E-mail of the person this will go to:</label> <input id="email" name="email" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false  || filter_var($_POST['email'], FILTER_VALIDATE_EMAIL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL or e-mail address you entered was invalid. Please try again.</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="email">Enter the E-mail of the person this will go to:</label> <input id="email" name="email" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=filter_var($_POST['URL'], FILTER_SANITIZE_URL);
    $email=filter_var($_POST['email'], FILTER_SANITIZE_EMAIL);

    $masthead="";
    $title="";
    $datetime="";
    $leftImage="";
    $article="";
    $footer="";

    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    $doc->loadHTMLFile($url);
    $xpath=new DomXPath($doc);

    $results = $xpath->query("//div[contains(@class, 'page-main-content')]");
    foreach($results as $cr)
    {
      $title=$cr->getElementsByTagName('h1')->item(0)->textContent;
    }
    echo $title;
  }
}
?>

RyanReese · March 18, 2015, 6:42pm

So far, everything is figured out. I’m beginning to transfer my content over to the e-mail template. I don’t think I’ll run into any more issues but I’ll let you know if I do! Thanks for everything.

Dormilich · March 19, 2015, 8:31am

you are aware that the code as given only uses the last found h1?

RyanReese · March 19, 2015, 9:48am

Yes: I should note that my code has changed dramatically since the last post. I almost entirely redid it.

RyanReese · March 19, 2015, 11:23am

So I have this now:

$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
$doc->loadHTMLFile($url);
$xpath=new DomXPath($doc);

$results = $xpath->query("//div[contains(@class, 'page-main-content')]");

foreach($results as $cr)
{
  //find first h1, add it to $title element
  $title=$cr->getElementsByTagName('h1')->item(0)->textContent;

  //find first img, add it to $title element
  $leftImage=$cr->getElementsByTagName('img')->item(0)->getAttribute('src');
  $url_info = parse_url($leftImage);

  if (!isset($url_info['host']))
  {
    $path = $url_info['path'];
    if (substr($path,0,1) !== '/') $path = '/'.$path;
    $leftImage= $dir.$path;
  }
  $leftImageAlt=$cr->getElementsByTagName('img')->item(0)->getAttribute('alt');

  //find all paragraph tags, add them to $article
  for($i=0;$i<$cr->getElementsByTagName('p')->length;$i++)
    $article.=$cr->getElementsByTagName('p')->item($i)->textContent."<br>";
}

I have the h1, and the image being grabbed. Now, how can I set it up so that besides these elements I’m grabbing (first image, first h1) that it will grab ALL OTHER elements and add it into the $article variable? Right now it’s just grabbing all paragraphs but realistically I need all other nodes/elements added into it aside from the few elements I’m grabbing.

I want it equivilant of me copy/pasting the other pages HTML, and putting it all inside $article (aside from a few select elements.)

Dormilich · March 19, 2015, 12:42pm

all other elements of what?

if you can live without line breaks: $cr->textContent, otherwise you need a clear picture of what elements should be displayed how.

RyanReese · March 19, 2015, 12:50pm

[quote=“Dormilich, post:34, topic:115635”]
if you can live without line breaks: $cr->textContent
[/quote]I’m using that but concentating a <br> into it.

Let’s say I have this sort of structure (pseudo code)

div.page-main-content
–h1
–span of date time etc
–p
–p
–p
–img
–ul
----li
–/ul
–p
–p
–/div

Now, I want EVERY element there to be added to $article, except for the img, and h1 (first occurance of each, and ignore all others.)

It can be any sort of tags that has the text. Not just paragraphs and ULs. It could be other spans or anything.

Basically this entire div holds an article. I want the whole article as the value for $article except for a few key elements.

http://www.codefundamentals.com/test2.php

See the header “Middle East Forum Presents: Dr. Robert Rubinstein, “Culture, Interagency Dynamics, and Health in the Middle East””?

From there, until the ending “The Regional Forum Lectures are sponsored by the Class of 1993.”…all that needs to be in my $article variable minus a few elements.

RyanReese · March 19, 2015, 12:58pm

If you go to codefundamentals.com/test.php

Enter in the test2.php URL and then a random email (it does nothing so far) you’ll see what it’s outputting now.

It works great except I need to exclude the last 2 paragraphs, and also I don’t account for any spans or ul/li or any other tags that might appear. Right now it’d be easy to miss chunks of text or lists with my method.

RyanReese · March 19, 2015, 2:46pm

I just went through all the other articles as a baseline and they only use paragraphs…so I think I’ll be fine.

If grabbing EVERY element and then sorting out from there to exclude the first h1/img etc is too much work, then perhaps we can just move on.

That being said, I believe my script is finished unless you can optimize it.

Dormilich · March 19, 2015, 3:20pm

in your for() loop, use length - 2 as break condition.

RyanReese · March 19, 2015, 3:24pm

Yes, I realized that shortly after I posted . I had to make it 3 actually but nevertheless.

Do you have any suggestions as to the more specific grabbing of elements as noted in post #37? I’m fine if you don’t.

Dormilich · March 19, 2015, 5:23pm

grabbing all elements gives you a plain (one-dimensional) list of items. the term first then may lose its necessary context. if that doesn’t matter, get that list, find the desired items and remove them (removeChild() returns the removed element to you, so you can still grab its content).

Topic		Replies	Views
Parsing selected parts of XML file with PHP HTML & CSS xml	1	1770	May 18, 2010
Urgent! DomDocument Vs SimpleXML in PHP PHP	6	6726	October 8, 2014
SimpleXML and CDATA PHP	2	2348	October 8, 2014
PHP SimpleXML Xpath PHP	4	764	February 20, 2010
SimpleXMLElement finds nodes but DOMXPath doesn't HTML & CSS xml	0	751	August 20, 2011

simpleXML Data

Related topics