How to remove some elements using DomDocument

nmeri17 · July 9, 2017, 12:54am

I’m trying to filter a HTML string containing some undesirable elements but my code doesn’t achieve the desired goal

function FunctionName($file) {
 $doc = new DOMDocument();

 libxml_use_internal_errors(true);
 $doc->loadHTML($file);

 $book = $doc->documentElement;

 // remove all these tags
$arr = [$book->getElementsByTagname('script'), $book->getElementsByTagname('iframe'), $book->getElementsByTagname('noscript')]; 

$domElemsToRemove = array(); 
 for ($i=0; $arr < count($arr); $i++) { 
	foreach ( $arr[$i] as $domElement ) { 
  
  $domElemsToRemove[] = $domElement; 
 }
}
 
foreach( $domElemsToRemove as $domElement ){ 
  $domElement->parentNode->removeChild($domElement); 
}

if ($book->getElementsByTagname('script')->length <= 0) {
	FunctionName($file);
}
die($doc->save('file.php'));
}

FunctionName($rawFile);

But that resulting file still contains the same elements as if nothing happened. How do I go about removing them without a library?

droopsnoot · July 9, 2017, 5:41pm

Does each of your arrays, $arr and $domElemsToRemove, contain exactly what you expect it to contain?

nmeri17 · July 9, 2017, 5:58pm

I tested the code with the script tag at first and it worked but the problematic script is still inserted ostensibly from one of the frames or noscript tag so I guess they should be loaded with nodes

Mittineague · July 10, 2017, 12:14am

Please do as you deem best, but if was me, I wouldn’t spend much time on writing code to remove mark-up. I would spend more time poring through my theme and plugin files to try and figure out where the origin of the mark-up was.

Then I would comment out or otherwise remove code.

nmeri17 · July 10, 2017, 7:17am

I’m building a crawler and this particular page has frame busters in it. The buster isn’t available in the page markup so I need to get rid of the frames dynamically inserting it. I have no luxury of sniffing out or commenting out. I can’t even tie down the buster when the page is fetched since it doesn’t exist then. I can bust the frame buster but cannot then post the crawled data to my server since the page would have gotten stranded or stuck on the current page. Except I use Ajax, in which case, the server receives the crawled data for the current page but is unable to progress to other links. I wonder if all of that makes any sense

Mittineague · July 10, 2017, 9:21pm

In that case I would contact the page author and ask what API (eg, RSS, JSON) I needed to use to be able to use their page content in my site.

If you are attempting to scrape their content and they have gone to the trouble of making that difficult to do, the chances are they did so because they don’t want their pages scraped. Hence, it’s best to ask them.

system · October 10, 2017, 4:21am

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.