SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Addict Latox's Avatar
    Join Date
    Dec 2008
    Location
    Australia
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Spidering a website

    I have coded a link exchange script, somebody submits their website, adds our link and then their site is shown after we have received a hit.

    I want to advance it so a cron/spider automatically runs and checks X website in our database to see if our link is still on their site, and if not set their link to inactive in our database and e-mail them.

    I know how to do all of this apart from coding a feature to check if our link is on their site, I have never done this before and I was just wondering how it is done and if anybody can help.

    Thanks
    :-)

  2. #2
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    It sounds like you have the logic sorted and you just need a pointer or two.

    Use cURL to obtain the website homepage, parse it using DOMDocument & XPath to extract all the links, then loop through these links to see if your site is one of them.

    Depending on whether or not your want to go into sub pages, cycle through the links to find all which contain the correct base address, then start the process again for this page.

    This could quickly become a quite intensive and lengthy process, as such, you should probably implement some kind of queuing system for a script (launched via cron) to cycle through.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  3. #3
    SitePoint Addict Latox's Avatar
    Join Date
    Dec 2008
    Location
    Australia
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    OK Thanks a lot

    Do you know of any tutorials? I get a better understanding when reading :P
    :-)

  4. #4
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Sorry, no. Although a quick search using the tools mentioned above as keywords should sort you out.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  5. #5
    SitePoint Addict Latox's Avatar
    Join Date
    Dec 2008
    Location
    Australia
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    XPath = XML?

    Do I need XML experience for this?
    :-)

  6. #6
    Twitter: @AnthonySterling silver trophy AnthonySterling's Avatar
    Join Date
    Apr 2008
    Location
    North-East, UK.
    Posts
    6,111
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    No, as HTML is structured, especially xHTML, so you can provide a pattern to target specific elements.

    Although XPath's origins do come from XML.
    @AnthonySterling: I'm a PHP developer, a consultant for oopnorth.com and the organiser of @phpne, a PHP User Group covering the North-East of England.

  7. #7
    SitePoint Addict Latox's Avatar
    Join Date
    Dec 2008
    Location
    Australia
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    OK Thanks a lot.

    Do you think many people would pay for a link exchange system like this if I decided to sell it?

    http://www.glavna.com/view/exchange (Tracks hits from actual referring domain, not a unique count URL so it's good for SEO too.)
    :-)

  8. #8
    SitePoint Enthusiast
    Join Date
    Nov 2007
    Posts
    63
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    yep curl should do it.

  9. #9
    SitePoint Member
    Join Date
    Feb 2009
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    i going to give you a tip to search the code of your link in other website with cURL

    Code:
    $ch=curl_init("http : / / remoteurl . com");
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
    $content=curl_exec($ch);
    curl_close($ch);
    if(strpos($content,"youhtmllinkcode")===false){
    echo "the code it's not there :( ";
    }
    i hope you find this useful

    Regards,
    Shadow.

    PD: i was need to put spaces in the url to submit it

  10. #10
    SitePoint Addict Latox's Avatar
    Join Date
    Dec 2008
    Location
    Australia
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you
    :-)

  11. #11
    SitePoint Addict Latox's Avatar
    Join Date
    Dec 2008
    Location
    Australia
    Posts
    389
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Done and coded.

    A sample e-mail a user gets when the cron is ran:

    Hello,

    We are sending you this e-mail because our automatic link exchange spider has visited your website, http://www.youtubetomp3.net and cannot find our website link.

    Please follow the steps on http://www.glavna.com/validate/22/exchange to re-activate your link.

    Kind regards,
    Glavna.com Admin
    :-)


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •