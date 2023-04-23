@tracknut

It’s nearly sunrise here and I got sleep in my eyes.

I tell you what, let me open up to you and then see what happens.

You see, I am trying to build a web crawler for my searchengine project.

I found 2 crawler codes.

First this simple one:

<?php //1. //General Page Crawler. Not Xml Sitemap Crawler. //--- include_once('simplehtmldom_1_9_1/simple_html_dom.php'); //--- //FAILS //$url = "https://www.rocktherankings.com/post-sitemap.xml"; //$url = "https://bytenota.com/sitemap.xml"; //$url = "https://www.rocktherankings.com/sitemap_index.xml"; //WORKS $url = "https://www.rocktherankings.com/footer-links-seo/"; //WORKS $url = ""; $html = new simple_html_dom(); $html->load_file($url); //-- foreach($html->find("a") as $link) { echo $link->href."<br>"; } ?>

Note the //FAILS & //WORKS as I listed on which urls the crawler failed to extract links from and on which it passed.

I do not have experience with crawlers and so later on learnt that, this particular crawler can only extract links from hrefs or a tags. Not from Xml files (SiteMaps).

Hence, I noted it down as:

//General Page Crawler. Not Xml Sitemap Crawler.

I then thought it is best to use a crawler that can extract links from SiteMaps as most websites, or atleast the professional ones, use SiteMaps (Xml files) to feed to crawlers.

So, it has to succeed in extracting links from Xml files. So, came across this 2nd one:

2

<?php //2. //Sitemap Crawler: If starting url is an xml file listing further xml files then it will just echo the found xml files and not extract links from them. //Sitemap Protocol: https://www.sitemaps.org/protocol.html include_once('simplehtmldom_1_9_1/simple_html_dom.php'); //WORKS. //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml'; //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files. //FAILS. Shows blank page. $sitemap = "https://bytenota.com/sitemap.xml"; $html = new simple_html_dom(); $html->load_file($sitemap); foreach($html->find("loc") as $link) { echo $link->innertext."<br>"; } ?>

Notice that this crawler does not search for a tags.

I was hoping this one would be able to extract links from all Xml files (SiteMaps) but it does not.

Notice the comments on which links it passed and on which it failed.

And so, I went hunting for another SiteMap crawler. Found this one:

<?php //3. //Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them. //Sitemap Protocol: https://www.sitemaps.org/protocol.html // sitemap url or sitemap file //FAILS. //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files. //WORKS //$sitemap = "https://bytenota.com/sitemap.xml"; //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml'; // get sitemap content $content = file_get_contents($sitemap); // parse the sitemap content to object $xml = simplexml_load_string($content); // retrieve properties from the sitemap object foreach ($xml->url as $urlElement) { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>---<br>'; } ?>

Again, I was hoping this one would be able to extract links from all Xml files (SiteMaps) but it does not.

Notice the comments on which links it passed and on which it failed.

This 3rd one failed to extract links from SiteMap (Xml file) that the previous one passed. But it passed to extract links from a SiteMap (Xml file) the previous one failed.

And so, I tried mixing the 2 codes up (2nd & 3rd code) so that way, I manage to build a SiteMap crawler that passes to extract links from all Xml SiteMaps. But why stop here ? WHy not get it to extract links from non-SiteMap (non-Xml files) too ?

So, best mix all 3 crawlers together.

Do test the above 2 SiteMap crawlers before testing my work below as the above 3 codes are from tutorial sites.

This was my attempt and I messed it up.

Tried UNSUCCESSULLY combining all 3 crawlers into 1:

<?php //Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them. //Sitemap Protocol: https://www.sitemaps.org/protocol.html $urls_from_xml_file_1 = array(); $urls_from_xml_file_2 = array(); $urls_to_extract_data_from = array(); //1. // sitemap url or sitemap file //WORKS. //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml'; //$sitemap = "https://bytenota.com/sitemap.xml"; //FAILS //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files. $sitemap = "https://www.rocktherankings.com/footer-links-seo/"; // get sitemap content $content = file_get_contents($sitemap); //1 // parse the sitemap content to object $xml = simplexml_load_string($content); // retrieve properties from the sitemap object foreach ($xml->sitemap as $urlElement) { // get properties $urls_from_xml_file_1[] = $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority; // print out the properties echo 'url: '. $url . '<br>'; echo 'lastmod: '. $lastmod . '<br>'; echo 'changefreq: '. $changefreq . '<br>'; echo 'priority: '. $priority . '<br>'; echo '<br>--'; } echo __LINE__; echo '<br>'; ///1 //2 include_once('simplehtmldom_1_9_1/simple_html_dom.php'); //WORKS. //$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml'; //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files. //FAILS. Shows blank page. //$sitemap = "https://bytenota.com/sitemap.xml"; //$sitemap = "https://www.rocktherankings.com/footer-links-seo/"; $html = new simple_html_dom(); $html->load_file($sitemap); foreach($html->find("loc") as $link) { $urls_from_xml_file_2[] = $link->innertext; echo $link->innertext."<br>"; } echo __LINE__; echo '<br>'; ///2 foreach($urls_from_xml_file_1 AS $item_1) { echo $item_1; echo '<br>'; } echo __LINE__; echo '<br>'; foreach($urls_from_xml_file_2 AS $item_2) { echo $item_2; echo '<br>'; } echo __LINE__; echo '<br>'; $urls_to_extract_data_from = array_merge($urls_from_xml_1,$urls_from_xml_2); foreach($urls_to_extract_data_from AS $item_3) { echo $item_3; echo '<br>'; } echo __LINE__; echo '<br>'; //3. include_once('simplehtmldom_1_9_1/simple_html_dom.php'); //FAILS //$url = "https://www.rocktherankings.com/post-sitemap.xml"; //$url = "https://bytenota.com/sitemap.xml"; //$url = "https://www.rocktherankings.com/sitemap_index.xml"; //WORKS $url = "https://www.rocktherankings.com/footer-links-seo/"; $html = new simple_html_dom(); $html->load_file($url); //-- foreach($html->find("a") as $link) { $urls_to_extract_data_from[] = $link->href; echo $link->href."<br>"; } ///3 echo __LINE__; echo '<br>'; foreach($urls_to_extract_data_from AS $item_4) { echo $item_4; echo '<br>'; } echo __LINE__; echo '<br>'; foreach($urls_to_extract_data_from AS $item_4) { $urls_to_extract_data_from[] = $link->href; // Assuming the above tags are at www.example.com $tags = get_meta_tags($item_4); // Notice how the keys are all lowercase now, and // how . was replaced by _ in the key. //echo $tags['author']; // name //echo $tags['keywords']; // php documentation echo $tags['description']; // a php manual //echo $tags['geo_position']; // 49.33;-86.59 } ?>

I get this error over and over again:

g: simplexml_load_string(): Entity: line 3: parser error : xmlParseEntityRef: no name in C:\wamp64\www\Work\buzz\Templates\crawler_Test.php on line 177

Call Stack

# Time Memory Function Location

1 0.0011 363240 {main}( ) …\crawler_Test.php:0

2 2.7371 533576 simplexml_load_string( $data = ‘



if(navigator.userAgent.match(/MSIE Internet Explorer/i) navigator.userAgent.match(/Trident\/7\…*?rv:11/i)){var href=document.location.href;if(!href.match(/[?&]nowprocket/)){if(href.indexOf(“?”)==-1){if(href.indexOf(“#”)==-1){document.location.href=href+“?nowprocket=1”}else{document.location.href=href.replace(“#”,“?nowprocket=1#”)}}else{if(href.indexOf(“#”)==-1){document.location.href=href+“&nowprocket=1”}else{document.location.href=href.’… ) …\crawler_Test.php:177

If you do not mind, can you first see where I went wrong and point it out to me ?

And then start from scratch yourself by taking the first 3 codes that I got from the tutorials and try yourself to combine the codes so the built crawler manages to extract links both from a tags aswell as from Xml files regardless whther these Xml files list links (a tags) or list further links to more Xml files. It should work. It should manage to extract links from Xml files no matter what the element is on the Xml file.

You do not want to build a crawler that fails to extract links from some Xml files. Now do you ?