Creating a sitemap crawler

I know there are several online sitemap generators but I’m trying to make a simple crawler to create a sitemap for my sites. This is the code I’ve got so far. It does what I want but so far for a single root URL. I realise I will need to make it recursive and there will no doubt be other issues to contend with, but can anyone tell me whether or not I’m heading in the right direction? Thanks

$base = 'http://www.domain.org';
$urls = array();
$html = file_get_contents($base);
$dom  = new DOMDocument;
$dom->loadHTML($html);
foreach ( $dom->getElementsByTagName('a') as $node ) {
  $href = $node->getAttribute('href');
  if ( $href !== '.' && strpos($href, '//') === false ) {
    $urls[$href] = '';
  }
}
echo $base, PHP_EOL;
foreach ( $urls as $key => $url ) {
  echo $base, '/', $key, PHP_EOL;
}

How are the pages generated? If they are individual pages ending in html I would try using PHP glob(), then for each element, if it is_dir(element) use glob() again else check if it has .html to populate an array.

If the pages are generated from a database then check the tables.

Edit:
The script should be in a function and call itself if is_dir(element); is true.

Thanks John. They’re hand crafted PHP pages. I only want to list pages on my domain which is why I exclude any that contain ‘//’.

Yes, the next step is to turn (most of) what I have into a function, assuming I’m not too far off the mark with my approach.

I’m not sure where glob() fits in with what I have so far?

How close do your URLs correspond to the filesystem? It might be more efficient to use an Iterator class than to use DOM parsing.

Thanks Mittineague Most of the sites correspond pretty well, but one of the sites I’m testing it with has the extensions removed. Most of the sites are less than 30 pages.

Er… where to begin with an iterator class?

The Iterator classes I’ve used work with the filesystem so the files would have the extension. As long as something like /widgets/foo.php corresponds to http://example.com/widgets/foo it should be doable.

My use is to auto-populate navigation lists, (too lazy busy to edit manually) but I don’t bother with extensionless file names in paths. I don’t care if the URL is http://example.com/widgets/foo.php But a strrpos, substr should work OK

I’m not at my desktop now, but I’ll post some code ASAP.

1 Like

q[quote=“gandalf458, post:6, topic:290008”]
Er… where to begin with an iterator class?
[/quote]

Try this:


I needed to find a way to get the full path of all files in the directory and all subdirectories of a directory. 
Here's my solution: Recursive functions! 

<?php 
function find_all_files($dir) 
{ 
    $root = scandir($dir); 
    foreach($root as $value) 
    { 
        if($value === '.' || $value === '..') {continue;} 
        if(is_file("$dir/$value")) {$result[]="$dir/$value";continue;} 
        foreach(find_all_files("$dir/$value") as $value) 
        { 
            $result[]=$value; 
        } 
    } 
    return $result; 
} 

Nicked from: http://php.net/manual/en/function.scandir.php

Edit:
This also may be of interest if file differences are required.

1 Like

Thanks John. I’ve used scandir() before, but never iteratively.

I had not thought of creating a sitemap this way. It means a lot of directories and files need to be omitted.

Sorry for the delay. My xfinity modem dropped the connection (again!!) and I’ve spent a good part of the day watching the UpStream DownStream light blink.

Anyway, this is something I found (pruned version)

<?php
error_reporting(E_ALL);
ini_set('display_errors', 'true');

$path = '.';

$rDirectoryIterator = new RecursiveDirectoryIterator($path);
$rIteratorIterator = new RecursiveIteratorIterator($rDirectoryIterator);
$iterator_filter_pattern = '/^.+\.(php|csv|gz)$/i';
$RegexIterator = new RegexIterator($rIteratorIterator, $iterator_filter_pattern, RecursiveRegexIterator::GET_MATCH);

$folder_exclude = "oneimpossiblylongstringthathasnochanceofbeingafoldername";

$found_array = [];            
foreach($RegexIterator as $path_file => $object){
  if (file_exists($path_file) && (strpos($path_file, $folder_exclude) === false)) {
        $found_array[] = $path_file;
  }
}
$found_array = array_unique($found_array);
echo "<pre>";
print_r($found_array);
echo "</pre>";
?>
1 Like

I think @Gandalf’s idea of creating a crawler is less performant than scanning file system but is more universal. There can be 101 ways of mapping files to web pages so for each web site the scanner would need to be specifically tailored for it. Sometimes there may even be no mapping at all and some pages being fetched from a database while others generated dynamically in some other ways.

An (semi-)external http crawler will work for any kind of site. You can use the last-modified meta tag on pages that would be read by the crawler and fed into the sitemap. A similar thing could be done with the priority value (I don’t know if there is a priority meta tag but why not create your own?).

But I think the important thing is to fill the sitemap with information that a normal search crawler will not get like priority, update frequency, etc. If you make a crawler or scanner that will just read the same data that a search engine crawler does then you will end up with a sitemap that does not bring any substantial benefit.

3 Likes

I wondered about scanning the file system, but that might include files that have no business being in the site map, old versions of things, work in progress, and so on. If it’s running on the live site then you probably wouldn’t (or shouldn’t) have those files in place, but I have plenty of stuff in my dev machine directories that I don’t want in the sitemap.

1 Like

It would be simple to add numerous elements to a “disallowed” array and only if not in_array( $file, $disallowed); allow the file and path to be added to the sitemap.

http://php.net/manual/en/function.in-array.php

2 Likes

Thanks guys. It looks like I have plenty to play with there. I shall have to come back to this a bit later as I hear a client calling…

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.