I’ve been trying things out with the ’ Simple HTML DOM Parser’. At this stage it’s not a full sitemap generator, I’m going one step at a time, so all it’s meant to do is print the crawl results on screen.
The first steps went well, I got it to crawl the homepage and list the URLs on screen.
Next I filtered out external URLs, that worked.
Then I filtered out other stuff I did not want to list, like ‘page index’ anchors to IDs, those begining with #, and that worked.
Next I made it distinguish between web pages and images, and put them into separate arrays, and that worked.
I got all this working, but just on the one home page.
The tricky bit is getting it to go through all the pages. This is the code I tried (which does not work)
<?php
include $_SERVER["DOCUMENT_ROOT"] . "/includes/simple_html_dom.php";
$imagetype = array('.jpg', '.jpeg', '.png', '.tif', '.tga', '.gif', '.bmp') ;
echo "<h1>My First Crawler</h1>\
" ;
// Start with the root
$root = 'http://www.example.com/' ; // Set this to an appropriate website
$page = 'index.php' ; // Set this to the page you want to start with
$tocrawl[] = $page ;
while ( count($tocrawl) > 0 ) {
$html = file_get_html($root.$page);
$donecrawl[] = $page ;
// Find all links
foreach($html->find('a') as $element) {
$urlval = filter_var($element->href, FILTER_VALIDATE_URL, FILTER_NULL_ON_FAILURE) ; // remove external URLs
if ($urlval == false){
$hashtest = substr($element->href, 0, 1); // remove id anchors
$java = stristr($element->href, 'javascript'); // remove javascript links
if (($hashtest != '#') && (!$java)){
$isimage = false ;
$i =0 ;
foreach($imagetype as $key) {
$img = stristr($element->href, $imagetype[$i]);
if ($img) { $isimage = true ;}
$i++;
}
if ( $isimage == true ) { $imgarr[] = $element->href ;}
else {
$urlarr[] = $element->href ;
$crawled = true ;
$crawled = in_array($element->href, $donecrawl) ;
if ($crawled == false) { $tocrawl[] = $element->href ; }
}
} // end not hash
} // end if val
} // end foreach
$tocrawl = array_unique($tocrawl);
$thiskey = array_search($page, $tocrawl);
$tocrawl = array_slice($tocrawl, $thiskey, 1);
$page = $tocrawl[0];
$html->clear();
unset($html);
} // end while count
$urlarr = array_unique($urlarr);
$imglarr = array_unique($imgarr);
echo "<h2>Relative URLs</h2>\
";
$u = 0;
foreach($urlarr as $key) {
echo "<p>".$urlarr[$u]."</p>\
" ;
$u++;
}
$u = 0;
echo "<h2>Images Onsite</h2>\
";
foreach($imgarr as $key) {
echo "<p>".$imgarr[$u]."</p>\
" ;
$u++;
}
?>
There is probably something glaringly obvious that’s wrong, but I don’t see it yet.
I was not sure the best method to loop through the URLs, I did think of a foreach on the $tocrawl array, but since the array is being built upon during the loop, I was not sure it would work.
Instead I went for a while loop, that goes until the $tocrawl array is empty.
This bit is meant to remove the current URL ($page) from the $tocrawl array
$thiskey = array_search($page, $tocrawl);
$tocrawl = array_slice($tocrawl, $thiskey, 1);
I thought there must be a better way to remove something from an array than this, but I couldn’t find it.
The bit that I think may be the problem is the bit that tells it which URL to crawl next, I used:-
$page = $tocrawl[0];
Which is supposed to set $page to the first thing in the $tocrawl array. But do the array keys get re-set (from 0 to the total number) when I use array_slice? Or did I just remove $tocrawl[0]?