Regex to remove empty paragraphs?

I’d like to remove paragraph tags that aren’t serving any good purpose; specifically, tags that contain an arbitrary amount of whitespace, encoded spaces, or a mixture of the two, such as:


<p>    </p>
<p> &nbsp;   </p>
<p>&nbsp;</p>
<p>&nbsp;&nbsp;&nbsp;</p>
<p>&nbsp;  &nbsp;  &nbsp;</p>

…should all be removed.

Thanks in advance for the sexy, regexy, awesomesauce!

:slight_smile:

No need to use regex, simply try str_replace.

Can you provide an example? I didn’t think str_replace could be used here since it’s unclear what mix of hardcoded spaces and encoded spaces will be present at any given time.

Regex would be more efficient than doing many replaces and validations, which you would have to do before you’re sure that the content is satisfactory. My main concern is that to filter a generic whitespace paragraph you’re going to have to do several replacement calls and multiple validation checks before knowing the paragraph is fine - or try to skip some steps and, in the process, mutilate valid paragraphs.

Or you could just do:

$Content = preg_replace('~\\s?<p>(\\s|&nbsp;)+</p>\\s?~', '', $Content);

Regular Expressions are not as inefficient as people expect them to be. They’re lower level than standard PHP code - (As far as I’m aware) It’s C that’s actually doing that regex replacement. That means that multiple PHP function runs would lag behind a single preg call.

Granted, for some jobs they’re a bit like hammering in a nail with a screwdriver. But for many things they’re much more efficient than basic string functions.

I believe an optimal str_replace method would be something like:

$Content = '<p>    </p>
<p> &nbsp;   </p>
<p>&nbsp;</p>
<p>&nbsp;&nbsp;&nbsp;</p>
<p>&nbsp;  &nbsp;  &nbsp;</p>';
$Content = str_replace('&nbsp;', '', $Content);
while(strpos($Content, '  ') !== false){
	$Content = str_replace(array('  ', '   ', '    '), ' ', $Content); //added a 3 and 4 whitespace -> null replacement for less loops
}
$Content = str_replace(array('<p> </p>', '<p></p>'), '', $Content);
echo $Content;

Jake: yes, definitely; regex is only as (in)efficient as (s)he who writes the expression. I had a feeling it would be the most elegant and maintainable solution, given the right expression. Thanks!