I want to properly replace these characters: < > " & for output into an HTML document.
However, the string is allowed to contain HTML. For example, my string might look like:
<a href="http://www.google.com">Woohoo &><"</a>
and I want to convert that to:
<a href="http://www.google.com">Woohoo &><"</a>
Notice how the actual HTML is unchanged, but special characters that are outside of HTML are replaced.
It’s almost like I need to strip_tags(), then do htmlspecialchars(), and then replace the tags that were stripped.
Any ideas?
Ok, I came up with a solution that seems to work so far:
$string = '<a href="http://www.google.com">Woohoo &><"</a>';
$non_html_chunks = preg_split('/<[a-z\\/][^>]*>/', $string, -1, PREG_SPLIT_NO_EMPTY);
foreach($non_html_chunks as $non_html_chunk){
$string = preg_replace('/(>|^)' . preg_quote($non_html_chunk, '/') . '(<|$)/m', '$1' . htmlspecialchars($non_html_chunk, ENT_COMPAT, 'UTF-8', false) . '$2', $string);
}
echo $string;
Of course if you try it you’ll need to view source to see that the appropriate characters have been replaced.
<a href="http://www.google.com">Woohoo &><"</a>
This is incorrect HTML in the first place. My suggestion is that that you use an HTML parser (a lax one) to parse the above string. The HTML parser might provide a get HTML method that returns the corrected HTML.
There is one at sourceforge.net called PHP Simple HTML DOM parser, I have used it and I believe it can parse bad HTML pretty well. But use a couple of test cases before you decide whether or not to use it in your project.
I’m not sure what you’re trying to say. Yes, obviously this is incorrect HTML:
<a href="http://www.google.com">Woohoo &><"</a>
It is incorrect because of the “special” characters that are not encoded. That is why I need to use a function like htmlspecialchars() to turn it into valid HTML.
I’ll check out your link though, thanks.
Walk it with [FPHP]DOMDocument[/FPHP] and run htmlspecialchars() on each contained text block?