Hi all,
I am writting a spider that parses all the links on a page.
The spider gets a link with a relative path.
How can I create a new URL to parse using the original URL and the link.
I need to count how many levels I need to go up and then concatenate my link on to that.
URL
domain.com/folderA/folderB/folderC/page1.html
LINK on page1.html
folder1/folder2/folder3/page2.html
New URL to parse
domain.com/folder1/folder2/folder3/page2.html
Thanks
hash
February 10, 2010, 10:57am
2
You could track where your spider is in the site.
I think I want to separate out the url by “/”
domain.com / folderA / folderB / folderC / page1.html
and then remove the a layer in the URL for each layer in the link.
unless folder names at the same level match.
Cups
February 10, 2010, 11:17am
4
If you are doing some spidering then you should know about wget.
This page from the manual will give you some ideas of what can be done with the --convert-links option
http://www.delorie.com/gnu/docs/wget/wget_31.html
Or is there some overwhelming need to do this using PHP?
I am keen to do this using php for a number of reasons.
I want to be able to index dymanic files.
The files might not be on my server.
More than anything it’s an adventure, a learning curve and hopefully I’ll be able to apply it to other things.
Ok I know how I want to go about doing this.
The code below gives me an the folder strucure of both the url and the link.
First I need to see if folderC = folder3
So I need to see if the last but 1 array items match and if they do I should
if not I need to comapre the last but 2 etc… ad nauseum.
<?
$domain = 'www.domain.com';
$url = 'http://www.domain.com/folderA/folderB/folderC/page1.html';
$link = 'folder1/folder2/folder3/page2.html';
echo "<p>domain is $domain </p>";
echo "<p>url is $url </p>";
echo "<p>link is $link</p>";
//remove http:// and domain from url
$url_base = 'http://'.$domain.'/';
$url_base_removed = str_replace ($url_base, '', $url);
echo "$url_base_removed <br><br>";
$url_folders = explode ("/",$url_base_removed);
$link_folders = explode ("/",$link);
print count($url_folders)." url folders <br>";
print count($link_folders)." link folders <br>";
//remove page name
?>
How do you delete the first item in an array?
$url_folders_reverse = array_reverse($url_folders);
$link_folders_reverse = array_reverse($link_folders);
while ($url_folders[0] == $link_folders[0]) {
//folders match
}
else {
//folders don't match delete the first item from each array and try again
}
Ok Unset removes the item from the array, but it doesn’t move the index.
How do I increment the array counter?
So if the first items in the arrays don’t match
$array_match = "0";
while ($array_match == "0") {
if ($arrayA[0] != $arrayB[0]) {
//increment array indexs
}
else {
$array_match="1";
}
}
Let’s take this a bit more slowly, to ensure that everyone is on the same page.
We start with two arrays
$arrayA = array(‘dog’, ‘cat’, ‘mouse’);
$arrayB = array(‘dog’, ‘cat’, ‘mouse’);
and we unset one of them from $arrayB
unset($arrayB[0]);
so that we have
$arrayA = array(0 => ‘dog’, 1 => ‘cat’, 2 => ‘mouse’);
$arrayB = array(1 => ‘cat’, 2 => ‘mouse’);
you want to check if both arrays contain the same contents.
Something like array_diff_assoc would do the job
$array_match = empty(array_diff_assoc($arrayA, $arrayB));
Paul,
I haven’t explained this very well.
I am trying to create a new file path from a full url and a relative path.
Starting URL http://www . domain.com/folder1/folder2/folder3/file1.html
The starting URL is the page I am spidering for links.
file1.html has a link on it to file2.html
and the path of the link could be relative or absolute, it could be expressed in a number of ways:
PATH 1 folderA/file2.html
PATH 2 folder2/folderA/file2.html
PATH 3 http://www . domain.com/folder1/folder2/folderA/file2.html
what I want to do is create a new full path from the PATH.
Turn these addresses in to arrays - DONE
remove domain name off LINK1 (I know if the PATH contains
then reverse the arrays - DONE
remove the file names (first item in each array)
Then compare the each level (array item) until we get a match.
Starting URL array
$array1[0] = file1.html
$array1[1] = folder3
$array1[2] = folder2
$array1[3] = folder1
LINK 2 array
$array2[0] = file2.html
$array2[1] = folderA
LINK 3 array
$array3[0] = file2.html
$array3[1] = folderA
$array3[2] = folder2
How can I compare my array items
to create the new link
http://www . domain.com/folder1/folder2/folderA/file2.html
If it’s a relative path, you add the path on from where the filename is.
An initial backslash is to start from the rootdirectory, and … is to go up the previous directory.
I suggest that you have a look at sphider, specifically the get_links function, url_purify function, and the parent_url_parts function from its spiderfuncs.php file