Incy Wincy Spider -

Hi all,

I am writting a spider that parses all the links on a page.

The spider gets a link with a relative path.

How can I create a new URL to parse using the original URL and the link.

I need to count how many levels I need to go up and then concatenate my link on to that.

URL
domain.com/folderA/folderB/folderC/page1.html

LINK on page1.html
folder1/folder2/folder3/page2.html

New URL to parse
domain.com/folder1/folder2/folder3/page2.html

Thanks

You could track where your spider is in the site.

I think I want to separate out the url by “/”

domain.com / folderA / folderB / folderC / page1.html

and then remove the a layer in the URL for each layer in the link.

unless folder names at the same level match.

If you are doing some spidering then you should know about wget.

This page from the manual will give you some ideas of what can be done with the --convert-links option

http://www.delorie.com/gnu/docs/wget/wget_31.html

Or is there some overwhelming need to do this using PHP?

I am keen to do this using php for a number of reasons.

I want to be able to index dymanic files.

The files might not be on my server.

More than anything it’s an adventure, a learning curve and hopefully I’ll be able to apply it to other things.

I want to be able to index dymanic files.
The files might not be on my server.

So what?

Ok I know how I want to go about doing this.

The code below gives me an the folder strucure of both the url and the link.

First I need to see if folderC = folder3

So I need to see if the last but 1 array items match and if they do I should

if not I need to comapre the last but 2 etc… ad nauseum.


<? 
$domain = 'www.domain.com';

$url = 'http://www.domain.com/folderA/folderB/folderC/page1.html';

$link = 'folder1/folder2/folder3/page2.html';

echo "<p>domain is $domain </p>";
echo "<p>url is $url </p>";
echo "<p>link is $link</p>";

//remove http:// and domain from url
$url_base = 'http://'.$domain.'/';

$url_base_removed =  str_replace ($url_base, '', $url);

echo "$url_base_removed <br><br>";

$url_folders = explode ("/",$url_base_removed);

$link_folders = explode ("/",$link);


print count($url_folders)." url folders <br>";
print count($link_folders)." link folders <br>";

//remove page name 


?>

How do you delete the first item in an array?


$url_folders_reverse = array_reverse($url_folders);
$link_folders_reverse = array_reverse($link_folders);

while ($url_folders[0] == $link_folders[0]) {

//folders match 
}
else {
//folders don't match delete the first item from each array and try again

}

See unset

Ok Unset removes the item from the array, but it doesn’t move the index.

How do I increment the array counter?

So if the first items in the arrays don’t match


$array_match = "0";
while ($array_match == "0") {
	if ($arrayA[0] !=  $arrayB[0]) {
	//increment array indexs
	}
	else {

	$array_match="1";
	}
}

Let’s take this a bit more slowly, to ensure that everyone is on the same page.

We start with two arrays

$arrayA = array(‘dog’, ‘cat’, ‘mouse’);
$arrayB = array(‘dog’, ‘cat’, ‘mouse’);

and we unset one of them from $arrayB
unset($arrayB[0]);

so that we have

$arrayA = array(0 => ‘dog’, 1 => ‘cat’, 2 => ‘mouse’);
$arrayB = array(1 => ‘cat’, 2 => ‘mouse’);

you want to check if both arrays contain the same contents.
Something like array_diff_assoc would do the job

$array_match = empty(array_diff_assoc($arrayA, $arrayB));

Paul,

I haven’t explained this very well.

I am trying to create a new file path from a full url and a relative path.

Starting URL http://www. domain.com/folder1/folder2/folder3/file1.html

The starting URL is the page I am spidering for links.

file1.html has a link on it to file2.html

and the path of the link could be relative or absolute, it could be expressed in a number of ways:

PATH 1 folderA/file2.html

PATH 2 folder2/folderA/file2.html

PATH 3 http://www. domain.com/folder1/folder2/folderA/file2.html

what I want to do is create a new full path from the PATH.

  1. Turn these addresses in to arrays - DONE
  2. remove domain name off LINK1 (I know if the PATH contains
  3. then reverse the arrays - DONE
  4. remove the file names (first item in each array)
  5. Then compare the each level (array item) until we get a match.

Starting URL array
$array1[0] = file1.html
$array1[1] = folder3
$array1[2] = folder2
$array1[3] = folder1

LINK 2 array
$array2[0] = file2.html
$array2[1] = folderA

LINK 3 array
$array3[0] = file2.html
$array3[1] = folderA
$array3[2] = folder2

How can I compare my array items

to create the new link

http://www. domain.com/folder1/folder2/folderA/file2.html

If it’s a relative path, you add the path on from where the filename is.
An initial backslash is to start from the rootdirectory, and … is to go up the previous directory.

I suggest that you have a look at sphider, specifically the get_links function, url_purify function, and the parent_url_parts function from its spiderfuncs.php file