SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    Community Advisor silver trophybronze trophy
    dresden_phoenix's Avatar
    Join Date
    Jun 2008
    Location
    Madison, WI
    Posts
    2,812
    Mentioned
    34 Post(s)
    Tagged
    2 Thread(s)

    targeting a tag with preg_replace...

    Regex is driving me nuts tonight.

    what I am trying to do ( as illogical as this goal may sound ) is to eliminate open/close tag pairs and their content. For example, in the following, I want to eliminate the part in red:

    <b class='test another' id='x'>hey<i><span class='sp'> some more stuff </span><em>


    in this example i want to target the open span.. its class... its content and then its closing span tags so that I get this :
    <b class='test another' id='x'>hey<i><em>
    ( sorry, am being redundant)

    I figured this was a job for preg_replace and a GOOD regex expression, this is what I have thus far...

    $string=preg_replace("/(^<(.+)\s*.*>).*(<\/(?(2)(.+))>)/","",$string)

    thinking that the regexp expression I created means the following...
    ( look for a pattern
    ^< that begins with "<" and is followed immediately by
    (.+) a pattern containing one or more charters ( captured pattern #2)
    \s? maybe followed by a space or no space
    .* maybe followed by 0 or more charcters
    >)and lastly an ">"
    .* after that there be 0 or more characters
    ( then another pattern
    </ which starts with "</" and is followed by
    (?(2)(.+))a pattern containing one or more charters which MATCH the characters of captured pattern #2 and is followed by
    >) / and lastly an ">" , end search...


    somewhere I am off... I would appreciate any fresh perspective on this...


    thanks in advance

  2. #2
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    maybe something like this:

    Code PHP:
    function removeTagPairs($html)
    {
        $c = 0;
        $html = preg_replace('#<([a-z]+)[^>]*>[^<]*</\1>#s', '', $html, -1, $c);
        return ($c > 0) ? removeTagPairs($html) : $html;
    }

    It's recursive so that if you had something like
    Code HTML4Strict:
    <b class='test another' id='x'>hey<i><span class='sp'> some <b>more</b> stuff </span><em>
    it would first remove the inner <b>more</b> tags, then the outer <span> tags.

    It won't remove things like:
    <span class='sp'> some <b>more stuff </span>
    aaron-fisher.com - PHP articles and more

  3. #3
    SitePoint Guru bronze trophy TomB's Avatar
    Join Date
    Oct 2005
    Location
    Milton Keynes, UK
    Posts
    996
    Mentioned
    9 Post(s)
    Tagged
    2 Thread(s)
    That will break under two potentially valid conditions.

    1) if nesting occurs:

    HTML Code:
    <span class='sp'> 
    <span class='sp'> 
    test
    </span>
    </span>
    Will break it.

    2) Closing </span> tags which don't belong to the correct pair:

    HTML Code:
    <span class="sp"> 
    Will be removed
    <span class="foo">
    Will be removed
    </span>
    Should be removed but won't be.
    <span>
    The best solution? Don't use regex to alter HTML. Use DomDocument.

  4. #4
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi Tom - I'm not arguing that DomDocument wouldn't be a safer bet, but to my eyes neither of your examples breaks the regex I posted - it handles nesting (as in your 1st example), and your second example is invalid html and as such shouldn't be removed if I'm understanding the OP's goal correctly.
    aaron-fisher.com - PHP articles and more

  5. #5
    Community Advisor silver trophybronze trophy
    dresden_phoenix's Avatar
    Join Date
    Jun 2008
    Location
    Madison, WI
    Posts
    2,812
    Mentioned
    34 Post(s)
    Tagged
    2 Thread(s)
    Well am not altering the DOM.. I am writing a PHP script parser.
    the idea kinda 1-upping wordpress, in a way. the data you saw will be reversedd and the tags closed.

    so the input is:
    "<b class='test another' id='x'>hey<i><span class='sp'> some more stuff </span><em>"


    will output :
    "</i></b></em> "

    and both of those will be wrapred around some other generated code...

    so as to complete a wrap around a script tag. I have got the whole thing works... except when there is an already closed tag pair as shown above... :/

    Am not sure if DOMDoc applies here


    Oh one more question cause I like your format...
    why did you use '#' instead of' / ' to open and close the expression?

    and is "/\1" how you reuse a captured pattern?

  6. #6
    SitePoint Guru aamonkey's Avatar
    Join Date
    Sep 2004
    Location
    kansas
    Posts
    953
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by dresden_phoenix View Post
    Oh one more question cause I like your format...
    why did you use '#' instead of' / ' to open and close the expression?
    I like to use symbols that are less likely to occur in the target string - it requires fewer escape characters and therefore becomes easier to read - /'s occur in html often, but not pound signs.

    Quote Originally Posted by dresden_phoenix View Post
    and is "/\1" how you reuse a captured pattern?
    almost - "\1" (without the quotes). the / is part of the closing tag
    aaron-fisher.com - PHP articles and more

  7. #7
    Community Advisor silver trophybronze trophy
    dresden_phoenix's Avatar
    Join Date
    Jun 2008
    Location
    Madison, WI
    Posts
    2,812
    Mentioned
    34 Post(s)
    Tagged
    2 Thread(s)
    ok, different strategy.. is there a way to do a negate... as in any character BUT ">"

  8. #8
    Community Advisor silver trophybronze trophy
    dresden_phoenix's Avatar
    Join Date
    Jun 2008
    Location
    Madison, WI
    Posts
    2,812
    Mentioned
    34 Post(s)
    Tagged
    2 Thread(s)
    oops, strike that

  9. #9
    Community Advisor silver trophybronze trophy
    dresden_phoenix's Avatar
    Join Date
    Jun 2008
    Location
    Madison, WI
    Posts
    2,812
    Mentioned
    34 Post(s)
    Tagged
    2 Thread(s)
    I am confused for one thing... I have entered your function...

    function removeTagPairs($html){
    $c = 0;
    $html = preg_replace('#<([a-z]+)[^>]*>[^<]*</\1>#s', '', $html, -1, $c);
    if ($c>0) {echo "more!!!";}
    return ($c > 0) ? removeTagPairs($html) : $html;
    }


    call it from :
    $wraped=removeTagPairs($wraped);
    ( I know it goes to remove tag pairs as I have tested this already)
    but the preg_ always returns false...

    this now baffles me more as:

    1. I have OTHER working preg_s in my code
    2. I tested your expression here

    did I make some horrible typo...? do I need a version of PHP higher than 5.2.6 for this to work?!?!

  10. #10
    SitePoint Guru bronze trophy TomB's Avatar
    Join Date
    Oct 2005
    Location
    Milton Keynes, UK
    Posts
    996
    Mentioned
    9 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by aamonkey View Post
    Hi Tom - I'm not arguing that DomDocument wouldn't be a safer bet, but to my eyes neither of your examples breaks the regex I posted - it handles nesting (as in your 1st example), and your second example is invalid html and as such shouldn't be removed if I'm understanding the OP's goal correctly.
    Ah I thought the OP wanted to remove a specific tag and its contents not *every* tag.

    dresden_phoenix: If your input length is long it's likely you're running into this bug: http://bugs.php.net/40846

    To get around it use

    Code:
    ini_set('pcre.backtrack_limit', 10000000);
    ini_set('pcre.recursion_limit', 10000000);

  11. #11
    Community Advisor silver trophybronze trophy
    dresden_phoenix's Avatar
    Join Date
    Jun 2008
    Location
    Madison, WI
    Posts
    2,812
    Mentioned
    34 Post(s)
    Tagged
    2 Thread(s)
    thanks guys.. actually it works very well .. the hitch was this

    #<([a-z]+)[^>]*>[^<]*</\\1>#s'


    I had the same problem earlier, on a javascript. I never knew/keep forgetting to double escape.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •