Comparing Crawled Page Text

First and foremost, I am a PHP novice so, if there’s better or more efficient way of doing what I’m trying to do, please feel free to point it out :slight_smile:

I came across an old PHP script that was used to crawl a site and check the response code on the pages found. I have modified it to do a duplicate content check. It’s using the similar_text function to compare 1 page’s content (specified by the user) against the content of each page it finds.

It’s a little slow but, its working. The only problem that I’m having is that it stops after about the first 10 links and I can’t figure out why.

Any help is greatly appreciated.


<form action="<?php echo $_SERVER['PHP_SELF']; ?>" method="post">       
<div class="row"><label for="page1" class="small label"><strong>Page? </strong>: </label><input type="text" name="page1" id="page1" value="" size="40" /></div>         
<div class="row"><label for="url" class="small label"><strong>Please Enter URL </strong>: </label><input type="text" name="url" id="url" value="" size="40" /></div>
<div class="row"><label for="maxlinks" class="small label"><strong>Number of links to get </strong>: </label><input type="text" name="maxlinks" id="maxlinks" value="25" size="3"  maxlength="3" /></div>
<div class="row"><label for="linkdepth" class="small label"><strong>Links Maximum depth</strong> : </label> <select name="linkdepth" id="linkdepth" ><option value="1">1</option>
<option value="2" selected="selected">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
<option value="6">6</option>
</select></div> 
<input type="submit" name="submit" style="font-weight: bold" value="Check links" id="submit" />
</form>
<?php 
if (isset($_POST['submit'])){
    $page1 = ($_POST['page1']);
    $baseurl = ($_POST['url']);
    $pages = array();
    $i=($_POST['linkdepth']);
    $maxlinks = (integer)$_POST['maxlinks'];

$domain= extract_domain_name($baseurl); 
echo '<p class="small">Extracted domain name: <strong>'.$domain.'</strong>. ';
echo 'Maximum depth: <strong>'.$i.'</strong></p>';
function get_urls($page){
    global  $domain, $i;

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $page);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_HEADER, true);
    /* Spoof the User-Agent header value; just to be safe */
    curl_setopt($ch, CURLOPT_USERAGENT, 
      'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
    /* I set timeout values for the connection and download
    because I don't want my script to get stuck 
    downloading huge files or trying to connect to 
    a nonresponsive server. These are optional. */
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 100);
    curl_setopt($ch, CURLOPT_TIMEOUT, 100);
    /* This ensures 404 Not Found (and similar) will be 
    treated as errors */
    curl_setopt($ch, CURLOPT_FAILONERROR, 0);

    /* Download the page */
    $html = curl_exec($ch);
  /* in case of an error*/  
    if(curl_exec($ch) === false)
        {
        echo '<p class="small">Error. Please check URL: <strong style="color:#ae3100">' . curl_error($ch).'</p></strong>';
        }

    curl_close($ch);

    if(!$html)   return false;
    /* Extract the BASE tag (if present) for
      relative-to-absolute URL conversions later */
        if(preg_match('/<base[\\s]+href=\\s*[\\"\\']?([^\\'\\" >]+)[\\'\\" >]/i',$html, $matches)){

        $base_url=$matches[1];
        echo $base_url;
            } else {
                    $base_url=$page; //base url = strani4ka s kotoroy na4inaetsa novaja porverka
                    }
            $links=array();
            $html = str_replace("\
", ' ', $html);


            preg_match_all('/<a[\\s]+[^>]*href\\s*=\\s*[\\"\\']?([^\\'\\" >]+)[\\'\\" >]/i', $html, $m);
        /* this regexp is a combination of numerous 
            versions I saw online*/
                foreach($m[1] as $url) {
                $url=trim($url);
                /* get rid of PHPSESSID, #linkname, & and javascript: */
                $url=preg_replace(
                    array('/([\\?&]PHPSESSID=\\w+)$/i','/(#[^\\/]*)$/i', '/&/','/^(javascript:.*)/i'),
                    array('','','&',''),
                    $url);

                /* turn relative URLs into absolute URLs. 
                  relative2absolute() is defined further down 
                  below on this page. */

                  $url =  relative2absolute($base_url, $url);

                     // check if in the same (sub-)$domain
                if(preg_match("/^http[s]?:\\/\\/[^\\/]*".str_replace('.', '\\.', $domain)."/i", $url)) 
                {
                $depth= substr_count($url, "/")-2 ; 

                /* Counts slashes in URL
                Responsible for link depth
                */

        if ($depth <= $i){

            if(!in_array($url, $links, check))  $links[]=$url; 
                }  } 
        } 

     return $links; 

}  

// Functions to crawl the next page
function next_page(){
    global $pages;
$k=0;
    foreach( array_keys($pages) as $k=> $page){

        if($pages[$page] == NULL){
            $k++;

            echo "[$k] - ";
            return $page;
        }
    }
    return NULL;
}

function add_urls($page){ // ads new unique urls in to array and checks each url for Server Header Status
    global $pages, $maxlinks;

    $start = microtime();
    $urls = get_urls($page);
    $resptime = microtime() - $start; // with microtime it is possible to find out on which page the crowler stops responding.

    //Start checking for Server Header
    $ch = curl_init($page);
    curl_setopt($ch, CURLOPT_NOBODY, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    // Execute
    curl_exec($ch);
    $info = curl_getinfo($ch);

    print "$page";

// If the status code os 200, then print OK, else = NO
//       if($info['http_code']==200) {
$page1 = ($_POST['page1']);
$page1data = file_get_contents($page1);
$page2 = file_get_contents($page);

$i = similar_text($page1data, $page2, $p);
$p = round($p, 2);

        echo ' -  Match Percentage:' . $p . '%';
//      } else {
//               echo '<strong style="color:#ba3d00"> NO </strong>';} 

            /* echo substr(($resptime),0,5). " seconds"; */ // Activate ths to see how much time it takes to crawl
            echo '<br/>';

        curl_close($ch); // Close handle

    $pages[$page] = array ('resptime' => floor($resptime * 9000), 'url' => $page);

    foreach($urls as $url){
        if(!array_key_exists($url, $pages)  && !in_array($url, $pages) && count($pages)<$maxlinks){
            $pages[$url] = NULL;
        } 

    }

}

echo '[1] - '; // this is for the first input url, as it will be extracted from input
add_urls($baseurl);

while(($page= next_page())  != NULL ) //while there are urls available


{
add_urls($page);

}   

    echo '<p class="small">Amount of crawled links: <strong>'.count ($pages).'</strong></p>'; 
    if (count($pages)<$maxlinks) echo '<p class="small">Sorry, no more links to crawl!!</p>';// count all extracted Urls
}

?><?php 
function extract_domain_name($url){
    /* old domain extractor 
    if(preg_match('@^(?:http:\\/\\/)?([^\\/]+)@i', $url, $matches)) {
        return trim(strtolower($matches[1]));
    } else {
        return '';
    }*/
    preg_match("/^(http:\\/\\/)?([^\\/]+)/i", $url, $matches);
    $host = $matches[2];
    // get last two segments of host name
    preg_match("/[^\\.\\/]+\\.[^\\.\\/]+$/", $host, $matches);
    return $matches[0];

}

function relative2absolute($absolute, $relative) {
$p = parse_url($relative);
if($p["scheme"])return $relative;
extract(parse_url($absolute));
$path = dirname($path);
if($relative{0} == '/')
{
$newPath = array_filter(explode("/", $relative));
}
else
{
$aparts = array_filter(explode("/", $path));
$rparts = array_filter(explode("/", $relative));
$cparts = array_merge($aparts, $rparts);
$k = 0;
$newPath = array();
foreach($cparts as $i => $part)
{
if($part == '..')
{
$k = $k - 1;
$newPath[$k] = null;
}
else
{
$newPath[$k] = $cparts[$i];
$k = $k + 1;
}
}
$newPath = array_filter($newPath);
}
$path = implode("/", $newPath);
$url = "";
if($scheme)
{
$url = "$scheme://";
}
if($user)
{
$url .= "$user";
if($pass)
{
$url .= ":$pass";
}
$url .= "@";
}
if($host)
{
$url .= "$host/";
}
$url .= $path;
return $url;
} 

##################################################
?>

How long does it take before it hangs? About 30 seconds?

PHP uses a time limit, which is set to 30 seconds by default. If the script hasn’t finished by that time, it stops.

All is not lost though, as you can change the value of the time limit using [fphp]set_time_limit()[/fphp].

If that’s not it, let us know :slight_smile:

Add these lines to the beginning of your script:

<?php
ini_set("error_reporting", E_ALL);
ini_set("display_errors", "On");
?>

See if your script complains about something before dying.

I did actually try setting that after each opening PHP statement and it didnt help.

Yeah, its odd because it runs through the first 10 fine but, when I turn error reporting on, its throwing errors for each result :confused:

Here are the errors that seem to repeat for each result:

Notice: Use of undefined constant check - assumed ‘check’ in /home/content/path/crawler.php on line 103

Notice: Undefined index: scheme in /home/content/path/crawler.php on line 212

Notice: Undefined variable: path in /home/content/path/crawler.php on line 214

Notice: Undefined variable: user in /home/content/path/crawler.php on line 248

My brain is having trouble wrapping around this function, but let me stick my toe in the water here a bit…


if($p["scheme"])return $relative;

Now… ignoring for the moment the missed space in there… parse_url is supposed to fail on relative URL’s… so… what you’re saying is if i got some data, return $relative because it’s really an absolute?

        if(!in_array($url, $links, check))  $links[]=$url;  Spot the missing $

Notice: Undefined index: scheme in /home/content/path/crawler.php on line 212

Notice: Undefined variable: path in /home/content/path/crawler.php on line 214

Notice: Undefined variable: user in /home/content/path/crawler.php on line 248

This indicates that your extract() didnt actually extract anything with that name from the URL parse. Check your inputs, maybe print_r(parse_url($absolute)) and figure out where the script is failing.

I’ll be honest with you, you lost me right there :stuck_out_tongue:

What’s the function SUPPOSED to do?

That particular function? I have no idea. That was part of the original code. As I said, if there is a better way of doing what I’m trying to accomplish, please feel free to point it out.

In order of my thought pattern:

The phrase ‘dont do it’ rings a bell…

A DOMDocument crawler would do things more efficiently than preg_matching all over the place…

Why do you feel the need to go out and compare content of two webpages on the same site?

similar_text is an undefined function.

relative2absolute can be rendered into a very short function based on existing information.

Its for seo purposes, identifying duplicate content

You mean something companies like google spend millions of dollars on you want to duplicate in a 300 line PHP script you picked up on some guy’s server with no knowledge of how it works… okay…

Anyway… Relative to Absolute. Metacode.
function (input) {
If exists (Domain root in Input) { Already Absolute. Return }
Else {Return Domain Root + Subdirs + Input }
}

Absolute.

The sarcasm isn’t needed or appreciated. I am in no way attempting to build something as complex as Google or any algorithm they are using. I know how to compare 2 pages with similar_text. I’m simply trying to figure out how to compare 1 page with the rest of the site, using similar_text.

I have no idea what to do with that.

Well, since you know how to compare text…

Compare the Potential-URL to the domain. The script already has the domain information.
If the Potential URL contains the domain, then it’s already absolute.
If it doesnt, (the url looks something like ‘quack.php’ or ‘moo/quack.php’), then stick the domain on the front of it, along with any subdirectories you’re currently in, and you’ve got a absolute address.

Okay . . .

Okay…

At this point you can give what i suggested a stab (You coded a SEO comparison function in PHP, so you’re not a novice. Try.), or wait for someone to try and fix your existing function.

I’m not going to just hand you a codeblock.

I’m not asking for you to hand me a codeblock. The script works up until the 10th result and I was just asking for a little guidance.

Regardless of what you might think, I am a PHP novice. I just barely pieced together these 2 scripts and the first 10-15 combos I had, didn’t work. Coupled with the fact that I have no idea what to do with the “advice” you gave on absolute and relative URL’s.

If you don’t feel like actually helping ( and I would say that’s blindingly obvious at this point ), feel free to skip this thread and move on.