SitePoint Sponsor

User Tag List

Results 1 to 13 of 13
  1. #1
    Trash Boat mkoenig's Avatar
    Join Date
    Aug 2007
    Posts
    1,232
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    PHP Script to Grab External Links Only

    Trying to build a small crawler.

    I want it only to grab external pages, and then truncate those to the main url.

    I will however settle for a script that just gets external urls on page.

    I've seen a few scripts that grab all links. I want external only however.

    Thanks A Lot

  2. #2
    SitePoint Wizard Sillysoft's Avatar
    Join Date
    May 2002
    Location
    United States :)
    Posts
    1,691
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can try the snoopy class, that grabs all links off a page. You can then run a seperate test to verify if its external or not.

  3. #3
    SitePoint Addict
    Join Date
    Jul 2004
    Location
    Salem, OR
    Posts
    272
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    mkoenig

    So the scenario is your crawling visits a page at http://domain.com/index.php.

    That page i.e., index.php contains 2 links:

    Code:
    <a href="/folder/index.php">Link 1</a>
    and

    Code:
    <a href="http://www.php.net/manual/en/function.str-replace.php">Link 2</a>
    You could read the document for each link, and check whether a) the link contains a domain name, if not it must be internal, b) if it does contain a domain name, whether that domain name is domain.com -if it isn't it must be external.

  4. #4
    Trash Boat mkoenig's Avatar
    Join Date
    Aug 2007
    Posts
    1,232
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by affordablemagic View Post
    You could read the document for each link, and check whether a) the link contains a domain name, if not it must be internal, b) if it does contain a domain name, whether that domain name is domain.com -if it isn't it must be external.
    I thought about that, but was thinking how long or short do i make the comparison, also if i make it short ie 4 char i would run into problems with domains that had the first 4 char the same.

    Im going to try this "snoopy class"

    Thanks Sillysoft

  5. #5
    Trash Boat mkoenig's Avatar
    Join Date
    Aug 2007
    Posts
    1,232
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It looks like snoopy class can't be ran from a share host ? It needs to be placed in the php folder right?

  6. #6
    SitePoint Wizard Sillysoft's Avatar
    Join Date
    May 2002
    Location
    United States :)
    Posts
    1,691
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mkoenig View Post
    It looks like snoopy class can't be ran from a share host ? It needs to be placed in the php folder right?
    It uses core functions of php I believe so it should work on a share host. Are you getting any errors? Can you show code that you are trying to use it with?

  7. #7
    Trash Boat mkoenig's Avatar
    Join Date
    Aug 2007
    Posts
    1,232
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Sillysoft View Post
    It uses core functions of php I believe so it should work on a share host. Are you getting any errors? Can you show code that you are trying to use it with?
    Yeah thanks!

    "Snoopy: error while fetching document: Invalid protocol ""\n"

    I was getting some other include error because it wouldn't read snoopy.inc.php or whatever so i renamed it snoopy.php

    I've tried it on a windows dedicated and apache shared both get same error.

    The code is....

    PHP Code:
    <?php 
    /*You need the snoopy.class.php from http://snoopy.sourceforge.net/*/  
     
    include("snoopy.php");  
     
    $snoopy = new Snoopy
     
    $url $_POST['url'];  
     
    // need an proxy?: 
    //$snoopy->proxy_host = "my.proxy.host"; 
    //$snoopy->proxy_port = "8080";  
     
    // set browser and referer: 
    $snoopy->agent "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
    $snoopy->referer "http://www.jonasjohn.de/";  
     
    // set some cookies: 
    $snoopy->cookies["SessionID"] = '238472834723489'
    $snoopy->cookies["favoriteColor"] = "blue";  
     
    // set an raw-header: 
    $snoopy->rawheaders["Pragma"] = "no-cache";  
     
    // set some internal variables: 
    $snoopy->maxredirs 2
    $snoopy->offsiteok false
    $snoopy->expandlinks false;  
     
    // set username and password (optional) 
    //$snoopy->user = "joe";//$snoopy->pass = "bloe";  
     
    // fetch the text of the website http://www.google.com: 
    if($snoopy->fetch($url)){  
    // other methods: fetch, fetchlinks, fetchform, fetchlinks, submittext and submitlinks  
    // response code:  
    //print "response code: ".$snoopy->response_code."<br/>\n";  
     
    // print the headers:  
    //print "<b>Headers:</b><br/>";  
    //while(list($key,$val) = each($snoopy->headers)){  
    // print $key.": ".$val."<br/>\n";  
    //}  
    //print "<br/>\n";  
     
    // print the texts of the website:  
    //print "<pre>".htmlspecialchars($snoopy->results)."</pre>\n";  
    echo $snoopy->results
    echo 
    "<br><br>"
    echo 
    "<a href='http://www.nombyte.com/~nomb/index.php'>Back to nombyte.com - spiders</a>"

    else {  
    print 
    "Snoopy: error while fetching document: ".$snoopy->error."\n"
    }  
    ?>
    See anything?

  8. #8
    SitePoint Wizard Sillysoft's Avatar
    Join Date
    May 2002
    Location
    United States :)
    Posts
    1,691
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The problem is it cant figure out what url your passing, http or https. It uses a switch to determine what to do. But its not finding http or https in your url variable. Make sure thats a valid url.

  9. #9
    Trash Boat mkoenig's Avatar
    Join Date
    Aug 2007
    Posts
    1,232
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That was it. I put a url and it worked fine. It actually displays the page in that page. Crazy.

    You know how to filter for only external links to be returned?

    Thanks

  10. #10
    SitePoint Wizard Sillysoft's Avatar
    Join Date
    May 2002
    Location
    United States :)
    Posts
    1,691
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Instead of $snoopy->fetch($url) use $snoopy->fetchlinks($url)

    That should store it into a variable or array, cant remember. From there you can apply a function to verify if the link is external or not

  11. #11
    Trash Boat mkoenig's Avatar
    Join Date
    Aug 2007
    Posts
    1,232
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Snoopy Fetchlinks Only returns this

    a page that has

    the text

    Array

    and nothing else?

    Im close... i can feel it. lol

  12. #12
    Trash Boat mkoenig's Avatar
    Join Date
    Aug 2007
    Posts
    1,232
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here is some code that gets all links.

    PHP Code:
    <?php
        $url 
    'http://www.google.com';
        
    $var fread_url($url);
                
        
    preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
                        
    "(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/"
                        
    $var, &$matches);
            
        
    $matches $matches[1];
        
    $list = array();

        foreach(
    $matches as $var)
        {    
            print(
    $var."<br>");
        }


    // The fread_url function allows you to get a complete
    // page. If CURL is not installed replace the contents with
    // a fopen / fget loop

        
    function fread_url($url,$ref="")
        {
            if(
    function_exists("curl_init")){
                
    $ch curl_init();
                
    $user_agent "Mozilla/4.0 (compatible; MSIE 5.01; ".
                              
    "Windows NT 5.0)";
                
    $ch curl_init();
                
    curl_setopt($chCURLOPT_USERAGENT$user_agent);
                
    curl_setopt$chCURLOPT_HTTPGET);
                
    curl_setopt$chCURLOPT_RETURNTRANSFER);
                
    curl_setopt$chCURLOPT_FOLLOWLOCATION );
                
    curl_setopt$chCURLOPT_FOLLOWLOCATION );
                
    curl_setopt$chCURLOPT_URL$url );
                
    curl_setopt$chCURLOPT_REFERER$ref );
                
    curl_setopt ($chCURLOPT_COOKIEJAR'cookie.txt');
                
    $html curl_exec($ch);
                
    curl_close($ch);
            }
            else{
                
    $hfile fopen($url,"r");
                if(
    $hfile){
                    while(!
    feof($hfile)){
                        
    $html.=fgets($hfile,1024);
                    }
                }
            }
            return 
    $html;
        }

    ?>

  13. #13
    SitePoint Wizard Sillysoft's Avatar
    Join Date
    May 2002
    Location
    United States :)
    Posts
    1,691
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here is how you use Snoopy to grab links

    PHP Code:
     
    //Include the class needed to run snoopy
    //and then create an instance to it and assign
    //it to a variable to be used throughout script
    include "Snoopy.class.php";
    $snoopy = new Snoopy;
     
    //We use the fetchlinks function provided by
    //the snoppy class. This goes out and grabs
    //all links on the webpage the url is pointed to
    $snoopy->fetchlinks($url);
     
    //Now we grab the results and assign it to a variable
    //This variable is an array so we count how many entries
    //in the array. If we dont we will get an error on the foreach
    //loop later in this script
    $links $snoopy->results;
    $link_count count($links);
     
    //Ok if there are entries in the array we go ahead and
    //loop through the array of links
    if($link_count 0)
    {
      
    foreach(
    $links AS $curr_link)
    {
       
    //Here I just clean it up a little bit. You can clean it
    //up anyway you want
    $curr_link trim($curr_link);
       
    //Uncomment the line below to see all the links snoopy
    //grabs from the url you provided above
    //echo $curr_link .'<br>';
     
    //From here you can then test to see if the link is external.
    //If it matches your criteria then escape the url and put
    //code to insert into database
       
    }
     
    //Once done then you can delete the original url from the db
      
     

    Database wise I suggest making the url field a unique key and then using a IGNORE statement when inserting the data so you dont have to test to see if the url is there or not.

    Silly


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •