Incy Wincy Spider -

tangledman · February 10, 2010, 10:51am

Hi all,

I am writting a spider that parses all the links on a page.

The spider gets a link with a relative path.

How can I create a new URL to parse using the original URL and the link.

I need to count how many levels I need to go up and then concatenate my link on to that.

URL
domain.com/folderA/folderB/folderC/page1.html

LINK on page1.html
folder1/folder2/folder3/page2.html

New URL to parse
domain.com/folder1/folder2/folder3/page2.html

Thanks

hash · February 10, 2010, 10:57am

You could track where your spider is in the site.

tangledman · February 10, 2010, 11:05am

I think I want to separate out the url by “/”

domain.com / folderA / folderB / folderC / page1.html

and then remove the a layer in the URL for each layer in the link.

unless folder names at the same level match.

Cups · February 10, 2010, 11:17am

If you are doing some spidering then you should know about wget.

This page from the manual will give you some ideas of what can be done with the --convert-links option

http://www.delorie.com/gnu/docs/wget/wget_31.html

Or is there some overwhelming need to do this using PHP?

tangledman · February 10, 2010, 11:35am

I am keen to do this using php for a number of reasons.

I want to be able to index dymanic files.

The files might not be on my server.

More than anything it’s an adventure, a learning curve and hopefully I’ll be able to apply it to other things.

system · February 10, 2010, 12:14pm

I want to be able to index dymanic files.
The files might not be on my server.

So what?

tangledman · February 10, 2010, 12:33pm

Ok I know how I want to go about doing this.

The code below gives me an the folder strucure of both the url and the link.

First I need to see if folderC = folder3

So I need to see if the last but 1 array items match and if they do I should

if not I need to comapre the last but 2 etc… ad nauseum.


<? 
$domain = 'www.domain.com';

$url = 'http://www.domain.com/folderA/folderB/folderC/page1.html';

$link = 'folder1/folder2/folder3/page2.html';

echo "<p>domain is $domain </p>";
echo "<p>url is $url </p>";
echo "<p>link is $link</p>";

//remove http:// and domain from url
$url_base = 'http://'.$domain.'/';

$url_base_removed =  str_replace ($url_base, '', $url);

echo "$url_base_removed <br><br>";

$url_folders = explode ("/",$url_base_removed);

$link_folders = explode ("/",$link);


print count($url_folders)." url folders <br>";
print count($link_folders)." link folders <br>";

//remove page name 


?>

tangledman · February 10, 2010, 1:02pm

How do you delete the first item in an array?


$url_folders_reverse = array_reverse($url_folders);
$link_folders_reverse = array_reverse($link_folders);

while ($url_folders[0] == $link_folders[0]) {

//folders match 
}
else {
//folders don't match delete the first item from each array and try again

}

DarthGuido · February 10, 2010, 1:12pm

See unset

tangledman · February 11, 2010, 11:09am

Ok Unset removes the item from the array, but it doesn’t move the index.

How do I increment the array counter?

So if the first items in the arrays don’t match


$array_match = "0";
while ($array_match == "0") {
	if ($arrayA[0] !=  $arrayB[0]) {
	//increment array indexs
	}
	else {

	$array_match="1";
	}
}

Paul_Wilkins · February 11, 2010, 12:15pm

Let’s take this a bit more slowly, to ensure that everyone is on the same page.

We start with two arrays

$arrayA = array(‘dog’, ‘cat’, ‘mouse’);
$arrayB = array(‘dog’, ‘cat’, ‘mouse’);

and we unset one of them from $arrayB
unset($arrayB[0]);

so that we have

$arrayA = array(0 => ‘dog’, 1 => ‘cat’, 2 => ‘mouse’);
$arrayB = array(1 => ‘cat’, 2 => ‘mouse’);

you want to check if both arrays contain the same contents.
Something like array_diff_assoc would do the job

$array_match = empty(array_diff_assoc($arrayA, $arrayB));

tangledman · February 11, 2010, 3:11pm

Paul,

I haven’t explained this very well.

I am trying to create a new file path from a full url and a relative path.

Starting URL http://www. domain.com/folder1/folder2/folder3/file1.html

The starting URL is the page I am spidering for links.

file1.html has a link on it to file2.html

and the path of the link could be relative or absolute, it could be expressed in a number of ways:

PATH 1 folderA/file2.html

PATH 2 folder2/folderA/file2.html

PATH 3 http://www. domain.com/folder1/folder2/folderA/file2.html

what I want to do is create a new full path from the PATH.

Turn these addresses in to arrays - DONE
remove domain name off LINK1 (I know if the PATH contains
then reverse the arrays - DONE
remove the file names (first item in each array)
Then compare the each level (array item) until we get a match.

Starting URL array
$array1[0] = file1.html
$array1[1] = folder3
$array1[2] = folder2
$array1[3] = folder1

LINK 2 array
$array2[0] = file2.html
$array2[1] = folderA

LINK 3 array
$array3[0] = file2.html
$array3[1] = folderA
$array3[2] = folder2

How can I compare my array items

to create the new link

http://www. domain.com/folder1/folder2/folderA/file2.html

Paul_Wilkins · February 11, 2010, 6:47pm

If it’s a relative path, you add the path on from where the filename is.
An initial backslash is to start from the rootdirectory, and … is to go up the previous directory.

I suggest that you have a look at sphider, specifically the get_links function, url_purify function, and the parent_url_parts function from its spiderfuncs.php file

Topic		Replies	Views
Creating a sitemap crawler PHP	13	4582	May 30, 2018
Get complete path of hidden url...? PHP	4	1114	August 23, 2011
Obscure URL and SEO question PHP	3	1207	April 13, 2016
Help with generating a simple list of urls PHP	10	767	March 13, 2010
Help with robot.txt Marketing	2	234	September 19, 2010

Incy Wincy Spider -

Related topics