SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    SitePoint Evangelist tangledman's Avatar
    Join Date
    Sep 2005
    Location
    Puerto de Mazarron, Murcia, Spain
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Incy Wincy Spider -

    Hi all,

    I am writting a spider that parses all the links on a page.

    The spider gets a link with a relative path.

    How can I create a new URL to parse using the original URL and the link.

    I need to count how many levels I need to go up and then concatenate my link on to that.

    URL
    domain.com/folderA/folderB/folderC/page1.html

    LINK on page1.html
    folder1/folder2/folder3/page2.html


    New URL to parse
    domain.com/folder1/folder2/folder3/page2.html


    Thanks

  2. #2
    SitePoint Wizard
    Join Date
    Nov 2005
    Posts
    1,191
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You could track where your spider is in the site.

  3. #3
    SitePoint Evangelist tangledman's Avatar
    Join Date
    Sep 2005
    Location
    Puerto de Mazarron, Murcia, Spain
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think I want to separate out the url by "/"

    domain.com / folderA / folderB / folderC / page1.html

    and then remove the a layer in the URL for each layer in the link.

    unless folder names at the same level match.

  4. #4
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    If you are doing some spidering then you should know about wget.

    This page from the manual will give you some ideas of what can be done with the --convert-links option

    http://www.delorie.com/gnu/docs/wget/wget_31.html

    Or is there some overwhelming need to do this using PHP?

  5. #5
    SitePoint Evangelist tangledman's Avatar
    Join Date
    Sep 2005
    Location
    Puerto de Mazarron, Murcia, Spain
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I am keen to do this using php for a number of reasons.

    I want to be able to index dymanic files.

    The files might not be on my server.

    More than anything it's an adventure, a learning curve and hopefully I'll be able to apply it to other things.

  6. #6
    Non-Member
    Join Date
    Oct 2009
    Posts
    1,852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I want to be able to index dymanic files.
    The files might not be on my server.
    So what?

  7. #7
    SitePoint Evangelist tangledman's Avatar
    Join Date
    Sep 2005
    Location
    Puerto de Mazarron, Murcia, Spain
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok I know how I want to go about doing this.

    The code below gives me an the folder strucure of both the url and the link.

    First I need to see if folderC = folder3

    So I need to see if the last but 1 array items match and if they do I should

    if not I need to comapre the last but 2 etc... ad nauseum.

    PHP Code:
    <? 
    $domain 
    'www.domain.com';

    $url 'http://www.domain.com/folderA/folderB/folderC/page1.html';

    $link 'folder1/folder2/folder3/page2.html';

    echo 
    "<p>domain is $domain </p>";
    echo 
    "<p>url is $url </p>";
    echo 
    "<p>link is $link</p>";

    //remove http:// and domain from url
    $url_base 'http://'.$domain.'/';

    $url_base_removed =  str_replace ($url_base''$url);

    echo 
    "$url_base_removed <br><br>";

    $url_folders explode ("/",$url_base_removed);

    $link_folders explode ("/",$link);


    print 
    count($url_folders)." url folders <br>";
    print 
    count($link_folders)." link folders <br>";

    //remove page name 


    ?>

  8. #8
    SitePoint Evangelist tangledman's Avatar
    Join Date
    Sep 2005
    Location
    Puerto de Mazarron, Murcia, Spain
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    How do you delete the first item in an array?


    PHP Code:
    $url_folders_reverse array_reverse($url_folders);
    $link_folders_reverse array_reverse($link_folders);

    while (
    $url_folders[0] == $link_folders[0]) {

    //folders match 
    }
    else {
    //folders don't match delete the first item from each array and try again



  9. #9
    From Italy with love silver trophybronze trophy
    guido2004's Avatar
    Join Date
    Sep 2004
    Posts
    9,500
    Mentioned
    163 Post(s)
    Tagged
    4 Thread(s)
    Quote Originally Posted by tangledman View Post
    How do you delete the first item in an array?
    See unset

  10. #10
    SitePoint Evangelist tangledman's Avatar
    Join Date
    Sep 2005
    Location
    Puerto de Mazarron, Murcia, Spain
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ok Unset removes the item from the array, but it doesn't move the index.

    How do I increment the array counter?

    So if the first items in the arrays don't match

    PHP Code:
    $array_match "0";
    while (
    $array_match == "0") {
        if (
    $arrayA[0] !=  $arrayB[0]) {
        
    //increment array indexs
        
    }
        else {

        
    $array_match="1";
        }


  11. #11
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,705
    Mentioned
    102 Post(s)
    Tagged
    4 Thread(s)
    Let's take this a bit more slowly, to ensure that everyone is on the same page.

    We start with two arrays

    $arrayA = array('dog', 'cat', 'mouse');
    $arrayB = array('dog', 'cat', 'mouse');

    and we unset one of them from $arrayB
    unset($arrayB[0]);

    so that we have

    $arrayA = array(0 => 'dog', 1 => 'cat', 2 => 'mouse');
    $arrayB = array(1 => 'cat', 2 => 'mouse');

    you want to check if both arrays contain the same contents.
    Something like array_diff_assoc would do the job

    $array_match = empty(array_diff_assoc($arrayA, $arrayB));
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript

  12. #12
    SitePoint Evangelist tangledman's Avatar
    Join Date
    Sep 2005
    Location
    Puerto de Mazarron, Murcia, Spain
    Posts
    425
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Paul,

    I haven't explained this very well.

    I am trying to create a new file path from a full url and a relative path.


    Starting URL http://www. domain.com/folder1/folder2/folder3/file1.html

    The starting URL is the page I am spidering for links.

    file1.html has a link on it to file2.html

    and the path of the link could be relative or absolute, it could be expressed in a number of ways:

    PATH 1 folderA/file2.html

    PATH 2 folder2/folderA/file2.html

    PATH 3 http://www. domain.com/folder1/folder2/folderA/file2.html


    what I want to do is create a new full path from the PATH.

    1. Turn these addresses in to arrays - DONE
    2. remove domain name off LINK1 (I know if the PATH contains
    3. then reverse the arrays - DONE
    4. remove the file names (first item in each array)
    3. Then compare the each level (array item) until we get a match.



    Starting URL array
    $array1[0] = file1.html
    $array1[1] = folder3
    $array1[2] = folder2
    $array1[3] = folder1

    LINK 2 array
    $array2[0] = file2.html
    $array2[1] = folderA

    LINK 3 array
    $array3[0] = file2.html
    $array3[1] = folderA
    $array3[2] = folder2

    How can I compare my array items

    to create the new link

    http://www. domain.com/folder1/folder2/folderA/file2.html

  13. #13
    Unobtrusively zen silver trophybronze trophy
    paul_wilkins's Avatar
    Join Date
    Jan 2007
    Location
    Christchurch, New Zealand
    Posts
    14,705
    Mentioned
    102 Post(s)
    Tagged
    4 Thread(s)
    If it's a relative path, you add the path on from where the filename is.
    An initial backslash is to start from the rootdirectory, and .. is to go up the previous directory.

    I suggest that you have a look at sphider, specifically the get_links function, url_purify function, and the parent_url_parts function from its spiderfuncs.php file
    Programming Group Advisor
    Reference: JavaScript, Quirksmode Validate: HTML Validation, JSLint
    Car is to Carpet as Java is to JavaScript


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •