If it can be stated very simply, is this the only difference between htmlspecialchars() and htmlentities() in PHP?
htmlspecialchars() will change
< > & " into the < etc
and it will change ’ into & #039; when ENT_QUOTES is set (a space is added here between & and # so that the forum won’t render it as a single quote)
On the other hand, htmlentities() will look for all possible ways to convert the characters into &[something]; such as é whenever it can. That is mainly to deal with non-English characters.
And that’s it.
One more thing to note is that the string is assumed to be in ISO-8859-1 (as a default), which is 1 byte per character.
If the string is actually in UTF-8, then maybe htmlspecialchars() and htmlentities() will behave the same, supposedly to be used with the 3rd argument as “UTF-8” when calling the function, and it will convert just those plain
& < > " ’
characters and not touch the international characters, since they are already taken to be UTF-8 characters by the browser.
I think changing the charset will affect how a “<” byte is treated in a string. (i mean a byte which has the value of 60 or in hex, 0x3C.
When no charset is set, then all “<” is changed to <
When charset is set to UTF-8, then even if there is a 0x3C in the string, it might happen to be the second or third or fourth byte of a UTF-8 character, and then the 0x3C is ignored (not converted into < )
actually, i did an experiment. htmlentities() will actually also convert the UTF-8 characters into any HTML entities if possible, such as the math symbols:
test:
<?php
function foo($s) {
echo "Hello world. the char is\
";
echo $s;
echo "\
";
$s1 = htmlspecialchars($s, ENT_COMPAT, "UTF-8");
var_dump($s1);
$s1 = htmlspecialchars($s);
var_dump($s1);
$s1 = htmlentities($s, ENT_COMPAT, "UTF-8");
var_dump($s1);
$s1 = htmlentities($s);
var_dump($s1);
}
header('Content-Type: text/html; charset=utf-8');
$my_string = chr(0xE2) . chr(0x89) . chr(0xA1); # UTF-8, the identical char
foo($my_string);
# Now another test
echo "\
\
Now another test\
\
";
$my_string = chr(0xCF) . chr(0x86); # UTF-8, the Phi char
foo($my_string);
?>
result viewed as source:
Hello world. the char is
≡
string(3) “≡”
string(3) “≡”
string(7) “≡”
string(15) “â�¡”
Now another test
Hello world. the char is
φ
string(2) “φ”
string(2) “φ”
string(5) “φ”
string(7) “Ï�”
so htmlentities() know how to convert the UTF-8 character which are the equiv symbol and the phi symbol into ≡ and φ
Note that if htmlentities() is called without “UTF-8”, then it thought it is 1 byte per character (ISO-8859-1), and convert some of those bytes into html entities such as â which is incorrect.
are really the same, since the presence of 0x3C can never be part of UTF-8. So converting 0x3C to < is safe.
On the other hand,
htmlentities($s, ENT_COMPAT, “UTF-8”);
and
htmlentities($s);
are different, since htmlentities() can convert many characters to &[something]; and those many characters can be the 2nd to 4th byte of a UTF-8 character. (binary 10xxxxxx, that means any byte greater than or equal to 0x80).