Need Help with making my own crawler in PHP

First the background on why.
When my site was only 15 - 20 pages, I made a sitemap manually, but updating the lastmod was still a drag. I then started using ‘phpSitemapNG’ from ‘enarion.net’, which served me well and did what I needed it to do.
As I started to learn a bit of PHP, then SQL and added more content the site, it has expanded to nearly 300 pages. But it seems that phpSitemapNG can’t handle that for some unknown reason. It crawls the whole site, lists all the URLs in it’s crawl results, but when it writes the XML file it misses so many of them off the end.
So I have been looking for alternatives, but nothing seems to fit my needs exactly.

Cut to the Chase
I thought of making my own sitemap generator, one that works the way I want it to work.
Listing the files in a directory, that’s easy, I can do that.
But since a lot (most) pages are generated by PHP reading from SQL, the same page file with a different variable, eg. ‘thispage.php?id=143’, I need the script to read the pages and identify the URLs that are onsite, ie crawl the site.
That’s the bit that has become the stumbling block for me, how do you make a crawler script?

I highly recommend you use Zend Search Lucene instead of building from beginning, you can check the document here http://framework.zend.com/manual/1.12/en/zend.search.lucene.html

Thanks. I will look into that.
I did think of one or two ideas about how to do it, but have not had time to try them out yet.

The Zend Lucene library is useful for adding a search facility to a site, but as far as I know I don’t think it could be used to generate a sitemap.

To build a crawler, I’d start with a HTML parsing library. I’ve used Simple HTML DOM Parser in the past and it’s worked well for me. There’s also a tutorial for it [URL=“http://code.tutsplus.com/tutorials/html-parsing-and-screen-scraping-with-the-simple-html-dom-library–net-11856”]here.

Parse the home page of your site and loop over all the URLs. Any that start with your domain, or are relative URLs, add to a ‘to crawl’ array. Keep a second array of crawled URLs that you can check against as you crawl new pages, to avoid your crawler going round in circles.

That looks promising, I will have a go at that. Thanks for the links.

Do you really need a crawler here? Can’t just go through the database and create the URL for all articles in there in stick that in a sitemap.xml?

That was an option I was considering. But not all pages and sub-pages are created from the data base, though on the other hand, there are not that many of those. I just wanted to explore the idea of a crawler, to find out if it were something fairly simple or too complex for me to try.
At present the site has only 24 PHP files that are actual pages. Just two of those PHP pages show varying content, effectively different pages, according to variables in the URL. The vast majority of these are from an ‘id=#’ variable, which relates to the ID number of an entry in SQL tables, so yes, these can be found by querying the database, and the 24 PHP pages can be found by searching the content of the root folder. One of the PHP pages has some additional variables that do not query the DB, they just tell the page which PHP include to use to change the content. There are not a lot of these, so they could be coded into the generator script. I just thought a crawler would fully automate this process, if it were relatively simple for me to do.

I’ve been trying things out with the ’ Simple HTML DOM Parser’. At this stage it’s not a full sitemap generator, I’m going one step at a time, so all it’s meant to do is print the crawl results on screen.
The first steps went well, I got it to crawl the homepage and list the URLs on screen.
Next I filtered out external URLs, that worked.
Then I filtered out other stuff I did not want to list, like ‘page index’ anchors to IDs, those begining with #, and that worked.
Next I made it distinguish between web pages and images, and put them into separate arrays, and that worked.
I got all this working, but just on the one home page.
The tricky bit is getting it to go through all the pages. This is the code I tried (which does not work)

<?php
include $_SERVER["DOCUMENT_ROOT"] . "/includes/simple_html_dom.php";
$imagetype = array('.jpg', '.jpeg', '.png', '.tif', '.tga', '.gif', '.bmp') ;
echo "<h1>My First Crawler</h1>\
" ;
// Start with the root
$root = 'http://www.example.com/' ; // Set this to an appropriate website
$page = 'index.php' ; // Set this to the page you want to start with
$tocrawl[] = $page ;
while ( count($tocrawl) > 0 ) {
    $html = file_get_html($root.$page);
    $donecrawl[] = $page ;
    // Find all links
    foreach($html->find('a') as $element) {
            $urlval = filter_var($element->href, FILTER_VALIDATE_URL, FILTER_NULL_ON_FAILURE) ; // remove external URLs
        if ($urlval == false){
            $hashtest = substr($element->href, 0, 1); // remove id anchors
            $java = stristr($element->href, 'javascript'); // remove javascript links
            if (($hashtest != '#') && (!$java)){
                $isimage = false ;
                $i =0 ;
                foreach($imagetype as $key) {
                    $img = stristr($element->href, $imagetype[$i]);
                    if ($img) { $isimage = true ;}
                    $i++;
                }
                if ( $isimage == true ) { $imgarr[] = $element->href ;}
                else {
                    $urlarr[] = $element->href ;
                    $crawled = true ;
                    $crawled = in_array($element->href, $donecrawl) ;
                    if ($crawled == false) { $tocrawl[] = $element->href ; }
                    }
            } // end not hash
        } // end if val
    } // end foreach
    $tocrawl = array_unique($tocrawl);
    $thiskey = array_search($page, $tocrawl);
    $tocrawl = array_slice($tocrawl, $thiskey, 1);
    $page = $tocrawl[0];
    $html->clear();
    unset($html);
} // end while count
$urlarr = array_unique($urlarr);
$imglarr = array_unique($imgarr);
echo "<h2>Relative URLs</h2>\
";
$u = 0;
foreach($urlarr as $key) {
    echo "<p>".$urlarr[$u]."</p>\
" ;
    $u++;
}
$u = 0;
echo "<h2>Images Onsite</h2>\
";
foreach($imgarr as $key) {
    echo "<p>".$imgarr[$u]."</p>\
" ;
    $u++;
}
?>

There is probably something glaringly obvious that’s wrong, but I don’t see it yet.
I was not sure the best method to loop through the URLs, I did think of a foreach on the $tocrawl array, but since the array is being built upon during the loop, I was not sure it would work.
Instead I went for a while loop, that goes until the $tocrawl array is empty.
This bit is meant to remove the current URL ($page) from the $tocrawl array

$thiskey = array_search($page, $tocrawl);
$tocrawl = array_slice($tocrawl, $thiskey, 1);

I thought there must be a better way to remove something from an array than this, but I couldn’t find it.
The bit that I think may be the problem is the bit that tells it which URL to crawl next, I used:-

$page = $tocrawl[0];

Which is supposed to set $page to the first thing in the $tocrawl array. But do the array keys get re-set (from 0 to the total number) when I use array_slice? Or did I just remove $tocrawl[0]?

Here is what I came up with:


<?php

function crawl($url) {
    curl_setopt_array($ch = curl_init(), array(
        CURLOPT_URL => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
    ));
    $res = curl_exec($ch);
    $info = curl_getinfo($ch);
    curl_close($ch);

    if ($info['http_code'] !== 200) {
        throw new Exception('Invalid response received for '.$url);
    }

    return $res;
}

function listUrls($title, $urls) {
    echo "<h2>$title</h2>\
";
    if (count($urls) > 0) {
        foreach ($urls as $url) {
            echo "<p>$url</p>\
";
        }
    } else {
        echo "<p>~ none ~</p>\
";
    }
}

include "./includes/simple_html_dom.php";
$imageRegex = '~\\.(jpg|jpeg|png|tiff?|tga|gif|bmp|pdf)~i';
echo "<h1>My First Crawler</h1>\
" ; 
// Start with the root 
$urlarr = $imgarr = $errorUrls = [];
$scheme = 'http'; // http or https
$root = 'www.example.com' ; // Set this to an appropriate website 
$tocrawl[] = $scheme.'://'.$root.'/'; // Start with the homepage at /
while (count($tocrawl) > 0) {
    $crawl = array_pop($tocrawl); // gets a page from the array and removes it from the array
    $donecrawl[] = $crawl;
    try {
        $data = crawl($crawl);
    } catch (Exception $e) {
        // crawling went wrong, skip this url
        $errorUrls[] = $crawl;
        continue;
    }
    $html = str_get_html($data);
    // Find all links 
    foreach ($html->find('a') as $element) {
        $testUrl = strtolower($element->href); // work with lowercase URL so we don't have any hassle with case sensitivity
        if (strpos($testUrl, '//') === 0) {
            $testUrl = $scheme.'://'.$testUrl; // when a URL starts with // assume it uses the same scheme as our main site
        }
        $host = parse_url($testUrl, PHP_URL_HOST);
        if (($host !== null && strtolower($host) === strtolower($root)) || strpos($testUrl, '/') === 0) { // absolute || relative
            $hashtest = substr($element->href, 0, 1); // check for id anchors
            $isJavascript = strpos($testUrl, 'javascript:') === 0; // remove javascript links 
            $isEmail = strpos($testUrl, 'mailto:') === 0; // remove mailto links
            $isRelative = strpos($testUrl, 'http://') === false && strpos($testUrl, 'https://') === false;
            if ($hashtest !== '#' && !$isJavascript && !$isEmail) {
                $isimage = preg_match($imageRegex, $element->href);
                if ($isimage) {
                    $imgarr[] = $element->href;
                } else {
                    $url = $isRelative ? $scheme . '://' . $root . $element->href : $element->href;
                    if (!in_array($url, $donecrawl) && !in_array($url, $tocrawl)) {
                        $tocrawl[] = $url;
                        $urlarr[] = $url;
                    }
                }
            } // end not hash
        } // end if val
    } // end foreach
    $html->clear();
    unset($html);
} // end while count 

// No need for array_unique. Because of the check in line 55
// URLs must be unique

listUrls("Found URLs", $urlarr);
listUrls("Error URLs", $errorUrls);
listUrls("Images onsite", $imgarr);

Works fine for the few local domains I tested.
If you have any questions just ask :slight_smile:

Thank you.
I will give it a try and see how it goes.

OK, Ihave tried the script out on a few sites and on some it works, but on others it doesn’t. It does run, but finds ~None~
More importantly, my site is one of the ones it doesn’t work for.
I’ll be honest, I don’t understand everything in the script. As you may have gathered I am a novice progammer without an ‘in depth’ knowledge of PHP. So I’m not certain why it doesn’t work on all sites.
But one observation on the type of sites it does work for and those it doesn’t.
Those that don’t work are fairly small, simple ‘hand built’ sites, where pages are html/php files all in the root folder. I first tried it on my site, then a couple of small sites that belong to friends. I didn’t want to try it on anything big, or it may take too long. These all failed to find anything. The first site that did work was the site for my employer, I did not make it, but have admin access and know it is made with Wordpress, so it is structured differently to the ‘hand built’ sites.
The ones the script works on are like:-

www.example.com/this_folder/
www.example.com/that_folder/
www.example.com/other_folder/

I guess all those folders just contain a ‘index.php’ file.
Whereas on my site, all pages (that I want crawled) are in the root:-

www.example.com/index.php
www.example.com/this_page.php
www.example.com/that_page.php
www.example.com/other_page.php
www.example.com/other_page.php?variable=value
www.example.com/other_page.php?variable=othervalue
www.example.com/other_page.php?othervariable=differentvalue
etc...

So I need something that works for that kind of website.

I think I know what’s going on, but am on mobile now so no time to investigate properly. Will respond in full tomorrow.

OK, thanks for your time.
I have been trying a few things out myself. I’m doing another version that is sort of a hybrid between what I did and your script. There is slight progress, I have it partly working on a friend’s website, but not on mine. That website is a very basic, pure html site with just a handful of pages. I have no idea why it works on that one now, but still won’t on mine.

I’m making a bit of headway now. I had it working well on another site, but not mine, I couldn’t work out why. The error log said:-

...PHP Fatal error:  Call to a member function find() on a non-object in etc...

I added an echo inside the while loop to see what it’s crawling and were it gets stuck

echo "<p>Crawling <q>".$root.$crawl."</q></p>\
" ;

Turns out it was trying to crawl an mp3 file.
So I added another test, so only webpages: php, htm or html files get added to the $tocrawl array.

My thinking is it isn’t working on your site because you are using true relative links, like <a href=“index.php?foo=bar”> and not <a href=“/index.php?foo=bar”>, which is what my script assumes.
You could try changing this line


if (($host !== null && strtolower($host) === strtolower($root)) || strpos($testUrl, '/') === 0) { // absolute || relative

to


if (($host !== null && strtolower($host) === strtolower($root)) || strpos($testUrl, 'http') !== 0) { // absolute || relative

and see if that helps anything.

So… here’s the thing. “Sitemaps are a way to tell Google about pages on your site [it] might not otherwise discover. … Creating and submitting a Sitemap helps make sure that Google knows about all the pages on your site, including URLs that may not be discoverable by Google’s normal crawling process.” (https://support.google.com/webmasters/answer/156184?hl=en)

If your sitemap is nothing more than the result of a crawling process, then it probably won’t be useful. You’re doing the same thing that Google’s crawler would already accomplish on its own.

Though, just for the fun of coding, if I were to make a crawler, I would absolutely use the Guzzle library. It gives you all the simplicity of file_get_… and all the power of Curl.

$client = new GuzzleHttp\\Client();

$response = $client->get('https://api.github.com/user', [
    'auth' =>  ['user', 'pass']
]);

echo $response->getStatusCode();           // 200
echo $response->getHeader('content-type'); // 'application/json; charset=utf8'
echo $response->getBody();                 // {"type":"User"...'
var_export($response->json());             // Outputs the JSON decoded data

I’d also use the Goutte library. Goutte uses Guzzle under the hood, but it adds the ability to move from page to page like a browser would, and it lets you sift through the response using CSS selectors.

$client = new Goutte\\Client();

// Go to the symfony.com website
$crawler = $client->request('GET', 'http://www.symfony.com/blog/');

// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);

// Get the latest post in this category and display the titles
$crawler->filter('h2.post > a')->each(function ($node) {
    print $node->text()."\
";
});

Oh, I had not seen these replies until I visited the thread just now. For some reason I’m not getting Email notifications for replies.
But in the mean time I have been working on it further and have the crawler working on my site.
I added a bit to get the ‘lastmod’ time for each URL, which is a little smarter than just using ‘filemtime()’, because that just gets the time for the main page file, not other sources of content on a page, such as database data and content includes, which may be updated more frequently than the base code for the page. I’m updating the script in my footer that displays the modification date to take account of these things. The DOM Parser then reads from that to get the lastmod time.
The next stage was to make the script record the results to an SQL table, this enables me to store additional settings for each URL.
I built the script into a web page interface that resides on a private admin section of the site. This displays the the list of URLs from the database in a form/table and allows me to, if I wish, to alter and save settings, such as <changefreq> and <priority> as well as uncheck to exclude any entry. Not unlike ‘phpSitemapNG’ which I was using, hopefully this will be better, being a bespoke system for my site, with the interface and functions I want.
The form has a number of submit buttons. Crawl, to run the crawler script and inform you of changes. Update, to save URL settings from the table/form. A Create button to write the sitemap XML file. I may also add a Ping Google button too. And why not, a big fat ‘do the whole lot in one action’ button.
I’m just at the stage now where I’m writing the script to write the XML file (the easy bit), so it’s not far off finished. I may post something when it’s done, if anyone’s interested.

If your sitemap is nothing more than the result of a crawling process, then it probably won’t be useful. You’re doing the same thing that Google’s crawler would already accomplish on its own.

Good point, I hear what you are saying. I guess we get drummed into us how important it is to have complete and up to date sitemap. But anyway I have already spent enough time (and made enough progress) on this to see it through and I’m sure it won’t do any harm.
I’m not sure how important <lastmod> tags are for the bots, if they take note and act upon them, but they are something that can stagnate on a sitemap that is not refreshed. They may tell that a page has not been updated in months when that’s not true. Thinking about it, it’s probably best not to use them if you are not keeping up to date. My purpose being, now I can keep up to date with little effort. Also the standard crawlers won’t have the smarter lastmod recognition I’m building into the site. For example an ‘Events’ page, a file called ‘events.php’ could sit on a server for years untouched with ‘filemtime()’ saying so. However the content it’s beeing fed from a database could be buzzing with life. I can record the time an entry in the database was added/edited, the page can read that and pass it to the footer script that can compare it with dates for other things that make up page content, eg. the page file or an article include file, then writes down the most recent for display and where my crawler can read it.
Maybe this stuff isn’t important, but I’m learningsome new tricks doing it, so I’m happy.