SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    Not Bad, eh? Justin Sampson's Avatar
    Join Date
    Aug 2000
    Location
    N.S., Canada
    Posts
    487
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hey,
    I'm working on a tool which checks the URL's on a page to see if they are broken.

    How can I take, for example, this:

    <html>
    <head>
    </head>
    <body>
    some text, a link: <a href="http://site.com">Visit Site.com</a><p>
    some more text, another link: <a href="http://site1.com">Visit Site1.com</a>
    </body>
    </html>

    And get all the urls out of it.

    Thanks,
    Justin Sampson

  2. #2
    SitePoint Enthusiast
    Join Date
    Feb 2001
    Location
    Monmouth Junction, NJ
    Posts
    88
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    its like making a spider engine you will have to make a fsocket connection. it hink its fsocket but not sure could be something else.

  3. #3
    Non-Member
    Join Date
    Apr 2000
    Location
    Waco, Texas.
    Posts
    188
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here is a very BASIC 'spider'.

    <?
    preg_match_all("/<a href=[\"\']([^[:space:]].+?)[\"\'].+?>.+?<\/a>/i", implode("",file("http://sitepointforums.com")), $urls, PREG_PATTERN_ORDER);
    $i = 0;
    while (list(,$url) = each($urls[1]))
    {
    echo "$url \n";
    }
    ?>


    Now if you run this, you will see that not all the links have http://www.url....You will need to do a regexp and add the full URL to the link before checking to see if it is broken or not.

    Hope that helps.

  4. #4
    Not Bad, eh? Justin Sampson's Avatar
    Join Date
    Aug 2000
    Location
    N.S., Canada
    Posts
    487
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks Rob, that will get me started


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •