SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Plain-Text Hyperlink Matching with Regex

    I've got some regex used for finding plain-text hyperlinks (ie. http://www.example.com/page) and converting them to html hyperlinks. I've using preg_replace to achieve this. Below is the code I have...

    PHP Code:
    $string preg_replace("/(http:\/\/)?([a-zA-Z0-9\-.]+\.[a-zA-Z0-9\-]+([\/]([a-zA-Z0-9_\/\-.?&%=+])*)*)/"'<a href="http://$2">$2</a>'$string); 
    That works absolutely wonderfully, except that it will apply it's magic to http links which have already been hyperlinked. For example, it will make <a href="http://www.site.com">http://www.site.com</a> into something like...

    Code HTML4Strict:
    <a href="<a href="http://www.site.com"><a href="http://www.site.com">http://www.site.com</a></a>">http://www.site.com</a>

    ...which is an absolutely mess and obviously not what I'm after. So, using regex look back functionality, can anyone add to my current regex to make it only match links string (ie. http://www.example.com) which aren't preceeded by a " character (quote) or an <a> tag?

    Any help is much appreciated as usual!

  2. #2
    hi galen's Avatar
    Join Date
    Jan 2006
    Location
    New Haven, CT
    Posts
    1,228
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

  3. #3
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Oh I've tried. I just don't seem to have the magic touch. Whenever the look behind fails, it just ignore the optional (http:\/\/) part of the regex, even if I enclose everything but the look behind in a set of brackets. I was wondering if someone could give me a working example that causes the entire regex to fail if the look behind fails, and basically give me an example of a complexish look behind (such as matching an <a> tag with any attributes in the look behind/around).

  4. #4
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2008
    Posts
    5,757
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Given my mediocre regex skills, I used a different solution when faced with a similar problem. If you're ok with a partial regex, partial string function solution, I can probably dig it up.

  5. #5
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I don't quite understand what your suggesting crmalibu. I'd rather keep the regex I have and simply implement an appropriate look behind.

  6. #6
    play of mind Ernie1's Avatar
    Join Date
    Sep 2005
    Posts
    1,252
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    PHP Code:
    $string '<a href="<a href="http://www.site.com"><a href="http://www.site.com">http://www.site.com</a></a>">http://www.site.com</a>';

    $pattern '/(?<=[>]).+(?=[<][^<][^\/][>]["])/i';

    preg_match($pattern,$string,$matches);

    print_r($matches);

    Array
    (
        [
    0] => <a href="http://www.site.com">http://www.site.com</a>

    my mobile portal
    ghiris.ro

  7. #7
    SitePoint Wizard bronze trophy
    Join Date
    Jul 2008
    Posts
    5,757
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Wardrop View Post
    I don't quite understand what your suggesting crmalibu. I'd rather keep the regex I have and simply implement an appropriate look behind.
    I had to hyperlink strings in html, but also needed to make sure I didn't hyperlink anything that was already inside an html tag as an attribute, or the anchor text, which is pretty much what you're doing too.

    I couldn't figure out how to do this entirely in regex. I think why I gave up was because I had read that PCRE doesn't support variable length lookbehind, and it seemed like that's one of the things I needed for this. I wouldn't know exactly what would preceed my match, so I figured one of the things I'd need to do would be look back to find an '<a' token, but I would have no idea how far back the regex would need to look for it, so it would be variable length.

    Code:
    <a title="variable length value" href="http://www.example.com"
    I figured I could instead look forward. Basically,
    • look ahead and see if I find the '</a>' token. If not, I can't possibly be inside a link. End.
    • look ahead and see if I find the '<a' token. If so, I'm inside a link. End. If not, I'm inside a link if only if the next '<a' token is farther away than the next '</a>' token. End.


    But I couldn't figure out how to apply that logic in a regular expression. So instead, I just used preg_match_all() with PREG_OFFSET_CAPTURE, looped through the results, and applied my logic with the assistance of php code. This logic could fail with certain malformed html(like, a forgotten closing tag</a>), as well as if I ever came across something like
    Code:
    <tag title="learn all about the <a> tag!"
    But I think those conditions are getting into sgml parser territory.

    Probably an easier solution would be to just first run a preg_replace_callback() on the entire text, which extracts all html links, and replaces each with a sufficiently unique identifier. Then apply your regex to this safe text, linking whatever urls are left. Then reinject the links you stripped out.

    Anyway, good luck. If you happen across a solution in regex, please make sure it gets posted here.

  8. #8
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Is there anyway I can have the entire regex NOT match if a quote (") is at the next character after the match. If I could do that, then any http://www... string proceeded by a " (quote) or the string "</a>" would not match. But it seems there's no way to do this in regex.

  9. #9
    SitePoint Guru
    Join Date
    Oct 2006
    Location
    Queensland, Australia
    Posts
    852
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Anyone able to help me further. Is there anyway I can convert a plain text link to html, but only if the string isn't next to a single or double quote, or between an <a></a> tag? Is there absolutely no way to achieve this?


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •