simpleXML Data

:cry:

that’s because classes do not exist in XML. and DOM targets XML as well as HTML.

Yeah I only will be needing to loop through HTML. With the knowledge that I still need about 9 more elements to pull, do you recommend looping still?

[quote=ā€œDormilich, post:18, topic:115635ā€]
(didn’t the XPath work?)
[/quote]Couldn’t get it working due to my noobness. How does your xpath factor into my code I have now? How can I rewrite it to remove hte unneeded loops and be optimal?

I cannot guarentee there only being one such wrapper. User uses a WYSIWYG and who knows…

If you have Windows, even if you don’t use the SDK the chm in it is a valuable reference for things XML

vvv download page vvv
https://www.microsoft.com/en-us/download/details.aspx?id=3988

it depends which elements you need, up till now only the h1 was mentioned.

I figured once I get one element, I can sort of use the same logic to get the other elements.

The number of elements can vary depending on how many paragraph tags there are. I’ll need at least the h1, img, and a handful of paragraph tags. Also 1 span.

You see now in my current code I use if() to determine if the header matches my criteria. I was going to just add conditions to match what I need. Maybe switches.

@Mittineague , I’m afraid I’m very restricted here at work. That will not be of help to me I’m afraid. It looks like I have to download that ot use it.

Update: I have this so far, which finds the first h1 and sets it into a variable.

I do have a question though, how can I use the getElementsByTagName (or something similar) as part of an IF condition? I need to do find if the certain element I’m using in my loop is a certain tag name, and then I need to add it as part of a variable

<?php
error_reporting(E_ALL);
if(true)
{
  if(!isset($_POST['submit']))
  {
  ?>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="email">Enter the E-mail of the person this will go to:</label> <input id="email" name="email" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else if(filter_var($_POST['URL'], FILTER_VALIDATE_URL) === false  || filter_var($_POST['email'], FILTER_VALIDATE_EMAIL) === false)
  {
  ?>
    <div class="error"><p>Error: The URL or e-mail address you entered was invalid. Please try again.</p></div>
    <form action="<?php echo htmlspecialchars($_SERVER["PHP_SELF"]); ?>" method="post">
    <label for="url">Enter the URL of the article:</label> <input id="url" name="URL" type="text" />
    <label for="email">Enter the E-mail of the person this will go to:</label> <input id="email" name="email" type="text" />
    <label for="submit"><input id="submit" class="button" name="submit" type="submit" /></form>
  <?php
  }
  else
  {
    $url=filter_var($_POST['URL'], FILTER_SANITIZE_URL);
    $email=filter_var($_POST['email'], FILTER_SANITIZE_EMAIL);

    $masthead="";
    $title="";
    $datetime="";
    $leftImage="";
    $article="";
    $footer="";

    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    $doc->loadHTMLFile($url);
    $xpath=new DomXPath($doc);

    $results = $xpath->query("//div[contains(@class, 'page-main-content')]");
    foreach($results as $cr)
    {
      $title=$cr->getElementsByTagName('h1')->item(0)->textContent;
    }
    echo $title;
  }
}
?>

So far, everything is figured out. I’m beginning to transfer my content over to the e-mail template. I don’t think I’ll run into any more issues but I’ll let you know if I do! Thanks for everything.

you are aware that the code as given only uses the last found h1?

Yes: I should note that my code has changed dramatically since the last post. I almost entirely redid it.

So I have this now:

$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
$doc->loadHTMLFile($url);
$xpath=new DomXPath($doc);

$results = $xpath->query("//div[contains(@class, 'page-main-content')]");

foreach($results as $cr)
{
  //find first h1, add it to $title element
  $title=$cr->getElementsByTagName('h1')->item(0)->textContent;

  //find first img, add it to $title element
  $leftImage=$cr->getElementsByTagName('img')->item(0)->getAttribute('src');
  $url_info = parse_url($leftImage);

  if (!isset($url_info['host']))
  {
    $path = $url_info['path'];
    if (substr($path,0,1) !== '/') $path = '/'.$path;
    $leftImage= $dir.$path;
  }
  $leftImageAlt=$cr->getElementsByTagName('img')->item(0)->getAttribute('alt');

  //find all paragraph tags, add them to $article
  for($i=0;$i<$cr->getElementsByTagName('p')->length;$i++)
    $article.=$cr->getElementsByTagName('p')->item($i)->textContent."<br>";
}

I have the h1, and the image being grabbed. Now, how can I set it up so that besides these elements I’m grabbing (first image, first h1) that it will grab ALL OTHER elements and add it into the $article variable? Right now it’s just grabbing all paragraphs but realistically I need all other nodes/elements added into it aside from the few elements I’m grabbing.

I want it equivilant of me copy/pasting the other pages HTML, and putting it all inside $article (aside from a few select elements.)

all other elements of what?

if you can live without line breaks: $cr->textContent, otherwise you need a clear picture of what elements should be displayed how.

[quote=ā€œDormilich, post:34, topic:115635ā€]
if you can live without line breaks: $cr->textContent
[/quote]I’m using that but concentating a <br> into it.

Let’s say I have this sort of structure (pseudo code)

div.page-main-content
–h1
–span of date time etc
–p
–p
–p
–img
–ul
----li
–/ul
–p
–p
–/div

Now, I want EVERY element there to be added to $article, except for the img, and h1 (first occurance of each, and ignore all others.)

It can be any sort of tags that has the text. Not just paragraphs and ULs. It could be other spans or anything.

Basically this entire div holds an article. I want the whole article as the value for $article except for a few key elements.

http://www.codefundamentals.com/test2.php

See the header ā€œMiddle East Forum Presents: Dr. Robert Rubinstein, ā€œCulture, Interagency Dynamics, and Health in the Middle Eastā€ā€?

From there, until the ending ā€œThe Regional Forum Lectures are sponsored by the Class of 1993.ā€ā€¦all that needs to be in my $article variable minus a few elements.

If you go to codefundamentals.com/test.php

Enter in the test2.php URL and then a random email (it does nothing so far) you’ll see what it’s outputting now.

It works great except I need to exclude the last 2 paragraphs, and also I don’t account for any spans or ul/li or any other tags that might appear. Right now it’d be easy to miss chunks of text or lists with my method.

I just went through all the other articles as a baseline and they only use paragraphs…so I think I’ll be fine.

If grabbing EVERY element and then sorting out from there to exclude the first h1/img etc is too much work, then perhaps we can just move on.

That being said, I believe my script is finished unless you can optimize it.

in your for() loop, use length - 2 as break condition.

Yes, I realized that shortly after I posted :slight_smile: . I had to make it 3 actually but nevertheless.

Do you have any suggestions as to the more specific grabbing of elements as noted in post #37? I’m fine if you don’t.

grabbing all elements gives you a plain (one-dimensional) list of items. the term first then may lose its necessary context. if that doesn’t matter, get that list, find the desired items and remove them (removeChild() returns the removed element to you, so you can still grab its content).