The only difference between htmlspecialchars() and htmlentities() in PHP

If it can be stated very simply, is this the only difference between htmlspecialchars() and htmlentities() in PHP?

htmlspecialchars() will change

< > & " into the < etc

and it will change ’ into & #039; when ENT_QUOTES is set (a space is added here between & and # so that the forum won’t render it as a single quote)

On the other hand, htmlentities() will look for all possible ways to convert the characters into &[something]; such as é whenever it can. That is mainly to deal with non-English characters.

And that’s it.

One more thing to note is that the string is assumed to be in ISO-8859-1 (as a default), which is 1 byte per character.

If the string is actually in UTF-8, then maybe htmlspecialchars() and htmlentities() will behave the same, supposedly to be used with the 3rd argument as “UTF-8” when calling the function, and it will convert just those plain

& < > " ’

characters and not touch the international characters, since they are already taken to be UTF-8 characters by the browser.

htmlspecialchars() does what it names applies, it only converts characters that have special meaning in HTML. & " ’ < >

Changing the charset will not alter that, it only changes they way it actually writes/reads the characters.

I think changing the charset will affect how a “<” byte is treated in a string. (i mean a byte which has the value of 60 or in hex, 0x3C.

When no charset is set, then all “<” is changed to <

When charset is set to UTF-8, then even if there is a 0x3C in the string, it might happen to be the second or third or fourth byte of a UTF-8 character, and then the 0x3C is ignored (not converted into < )

actually, i did an experiment. htmlentities() will actually also convert the UTF-8 characters into any HTML entities if possible, such as the math symbols:

test:

<?php

function foo($s) {
	echo "Hello world.  the char is\
";
	echo $s;
	echo "\
";


	$s1 = htmlspecialchars($s, ENT_COMPAT, "UTF-8");
	var_dump($s1);

	$s1 = htmlspecialchars($s);
	var_dump($s1);

	$s1 = htmlentities($s, ENT_COMPAT, "UTF-8");
	var_dump($s1);

	$s1 = htmlentities($s);
	var_dump($s1);
}	

header('Content-Type: text/html; charset=utf-8');

$my_string = chr(0xE2) . chr(0x89) . chr(0xA1);		# UTF-8, the identical char
foo($my_string);
	

# Now another test

echo "\
\
Now another test\
\
";

$my_string = chr(0xCF) . chr(0x86);		# UTF-8, the Phi char
foo($my_string);

?>

result viewed as source:

Hello world. the char is

string(3) “≡”
string(3) “≡”
string(7) “≡”
string(15) “â�¡”

Now another test

Hello world. the char is
φ
string(2) “φ”
string(2) “φ”
string(5) “φ”
string(7) “Ï�”

so htmlentities() know how to convert the UTF-8 character which are the equiv symbol and the phi symbol into ≡ and φ

Note that if htmlentities() is called without “UTF-8”, then it thought it is 1 byte per character (ISO-8859-1), and convert some of those bytes into html entities such as â which is incorrect.

by the way, seems like because in UTF-8, the 2nd to 4th byte can never be 0x3C (“<”) or 0x3E (“>”), or any of & " ’

(see http://en.wikipedia.org/wiki/UTF-8 the 2nd to 4th byte always has “10” as the most significant bits)

As a result,

htmlspecialchars($s, ENT_COMPAT, “UTF-8”);

and

htmlspecialchars($s);

are really the same, since the presence of 0x3C can never be part of UTF-8. So converting 0x3C to < is safe.

On the other hand,

htmlentities($s, ENT_COMPAT, “UTF-8”);

and

htmlentities($s);

are different, since htmlentities() can convert many characters to &[something]; and those many characters can be the 2nd to 4th byte of a UTF-8 character. (binary 10xxxxxx, that means any byte greater than or equal to 0x80).