SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Enthusiast
    Join Date
    Aug 2003
    Location
    UK
    Posts
    47
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Regural expresion to search page for links?

    I am developing a link checker program.

    I have a page in a variable and need to search it for links.

    It should only find relative or apsolute links and only http links.
    ie http://etc or mypage.php
    its going to ignore https, ftp, mailto, javascript links etc.

    Never got my head round regural expressions,
    got some code:
    PHP Code:
    $pattern  "/((@import\s+[\"'`]([\w:?=@&\/#._;-]+)[\"'`];)|";
       
    $pattern .= "(:\s*url\s*\([\s\"'`]*([\w:?=@&\/#._;-]+)";
       
    $pattern .= "([\s\"'`]*\))|<[^>]*\s+(href)\=[\s\"'`]*";
       
    $pattern .= "([\w:?=@&\/#._;-]+)[\s\"'`]*[^>]*>))/i"
    But its too complicated for my needs. Has anyone got a decent reg expression to find simple links?

  2. #2
    Web-coding NINJA! silver trophy beetle's Avatar
    Join Date
    Jul 2002
    Location
    Dallas, TX
    Posts
    2,900
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Are you trying to capture the entire anchor element? Or just the URL itself?

    A basic problem with what you're attempting is that you can't do this with just one pattern. Note: Can't in PHP. If PHP's PCRE engine supported variable-length look-behind assertions, then you could.

    What will probably be easiest is to capture all the links, then remove the offending ones from the resulting array.
    PHP Code:
     $pattern '/<a +(?:.*?)href="(.+?)"(?:.*?)>/';
     
     
    $cleaningPattern "/^(https|ftp):\/\/|(mailto|javascript):/";
     
     
    preg_match_all$pattern$html$matches );
     
     
    $output = array();
     foreach ( 
    $matches[1] as $match )
     {
         if ( !
    preg_match$cleaningPattern$match ) )
         {
             
    $output[] = $match;
         }
     } 
    Now, $output is an array of all the links in your criteria

    This pattern makes certain assumptions about the HTML
    • only spaces separate the '<a' and the element's attributes
    • the href property is delimited by double quotes
    • The anchor elements are defined with a lowercase a
    beetle a.k.a. Peter Bailey
    blogs: php | prophp | security | design | zen | software
    refs: dhtml | gecko | prototype | phpdocs | unicode | charsets
    tools: ide | ftp | regex | ffdev




  3. #3
    SitePoint Enthusiast
    Join Date
    Aug 2003
    Location
    UK
    Posts
    47
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks beetle,

    Yes, I only want the url itself.

    I will try and make some modifications to your code to make it more general (ie find a and A etc)


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •