Targeting a tag with preg_replace

Regex is driving me nuts tonight.

what I am trying to do ( as illogical as this goal may sound ) is to eliminate open/close tag pairs and their content. For example, in the following, I want to eliminate the part in red:

<b class=‘test another’ id=‘x’>hey<i><span class=‘sp’> some more stuff </span><em>

in this example i want to target the open span… its class… its content and then its closing span tags so that I get this :
<b class=‘test another’ id=‘x’>hey<i><em>
( sorry, am being redundant)

I figured this was a job for preg_replace and a GOOD regex expression, this is what I have thus far…

$string=preg_replace(“/(^<(.+)\s*.>).(<\/(?(2)(.+))>)/”,“”,$string)

thinking that the regexp expression I created means the following…
( look for a pattern
^< that begins with “<” and is followed immediately by
(.+) a pattern containing one or more charters ( captured pattern #2)
\s? maybe followed by a space or no space
.* maybe followed by 0 or more charcters
>)and lastly an “>”
.* after that there be 0 or more characters
( then another pattern
</ which starts with “</” and is followed by
(?(2)(.+))a pattern containing one or more charters which MATCH the characters of captured pattern #2 and is followed by
>) / and lastly an “>” , end search…

somewhere I am off… I would appreciate any fresh perspective on this…

thanks in advance

maybe something like this:

function removeTagPairs($html)
{
    $c = 0;
    $html = preg_replace('#<([a-z]+)[^>]*>[^<]*</\\1>#s', '', $html, -1, $c);
    return ($c > 0) ? removeTagPairs($html) : $html;
}

It’s recursive so that if you had something like

<b class='test another' id='x'>hey<i><span class='sp'> some <b>more</b> stuff </span><em>

it would first remove the inner <b>more</b> tags, then the outer <span> tags.

It won’t remove things like:
<span class=‘sp’> some <b>more stuff </span>

That will break under two potentially valid conditions.

  1. if nesting occurs:

<span class='sp'> 
<span class='sp'> 
test
</span>
</span>

Will break it.

  1. Closing </span> tags which don’t belong to the correct pair:

<span class="sp"> 
Will be removed
<span class="foo">
Will be removed
</span>
Should be removed but won't be.
<span>

The best solution? Don’t use regex to alter HTML. Use DomDocument.

Hi Tom - I’m not arguing that DomDocument wouldn’t be a safer bet, but to my eyes neither of your examples breaks the regex I posted - it handles nesting (as in your 1st example), and your second example is invalid html and as such shouldn’t be removed if I’m understanding the OP’s goal correctly.

Well am not altering the DOM… I am writing a PHP script parser.
the idea kinda 1-upping wordpress, in a way. the data you saw will be reversedd and the tags closed.

so the input is:
“<b class=‘test another’ id=‘x’>hey<i><span class=‘sp’> some more stuff </span><em>”

will output :
"</i></b></em> "

and both of those will be wrapred around some other generated code…

so as to complete a wrap around a script tag. I have got the whole thing works… except when there is an already closed tag pair as shown above… :confused:

Am not sure if DOMDoc applies here

Oh one more question cause I like your format…
why did you use ‘#’ instead of’ / ’ to open and close the expression?

and is “/\1” how you reuse a captured pattern?

I like to use symbols that are less likely to occur in the target string - it requires fewer escape characters and therefore becomes easier to read - /'s occur in html often, but not pound signs.

almost - “\1” (without the quotes). the / is part of the closing tag

ok, different strategy… is there a way to do a negate… as in any character BUT “>”

oops, strike that

I am confused for one thing… I have entered your function…

function removeTagPairs($html){
$c = 0;
$html = preg_replace(‘#<([a-z]+)[^>]>[^<]</\1>#s’, ‘’, $html, -1, $c);
if ($c>0) {echo “more!!!”;}
return ($c > 0) ? removeTagPairs($html) : $html;
}

call it from :
$wraped=removeTagPairs($wraped);
( I know it goes to remove tag pairs as I have tested this already)
but the preg_ always returns false…

this now baffles me more as:

  1. I have OTHER working preg_s in my code
  2. I tested your expression here

did I make some horrible typo…? do I need a version of PHP higher than 5.2.6 for this to work!!!

Ah I thought the OP wanted to remove a specific tag and its contents not every tag.

dresden_phoenix: If your input length is long it’s likely you’re running into this bug: http://bugs.php.net/40846

To get around it use


ini_set('pcre.backtrack_limit', 10000000);
ini_set('pcre.recursion_limit', 10000000);

thanks guys… actually it works very well … the hitch was this

#<([a-z]+)[^>]>[^<]</\\1>#s’

I had the same problem earlier, on a javascript. I never knew/keep forgetting to double escape.