Why would anyone want to store the html entity version of (for example) É in their database these days? If I understand things correctly, as long as my data is utf-8 encoded both in my data base and how I retrieve it, and I specify UTF-8 as the charset in my meta tag:
<meta charset="UTF-8" />
then the É will be both correctly saved and then rendered back in the browser. It would seem to me, then, that the html entity will just take up extra space in my database.
Is there any downside to not encoding the characters?
Storing entities in database made sense back in the days where the choice of character sets was limited and unicode could not be used. Then when the web site's encoding was, for example, ISO-8852-1 then any characters outside the character set were sent as entities by the browser.
Currently when you can use unicode I don't see any advantage to storing entities. When you store entities you may create unnecessary problems for yourself, for example one day you may want to use the data for other purpose than sending to a web browser, for example to put it in a plain text email or in a php-generated Word/PDF/spreadsheet document, etc. Then you would need to convert the entities back to their original characters. With a unicode character set you can use the data straight away.
Keep in mind too that the purpose of an HTML entity - or any HTML, for that matter - is to render your text, that is, to show a visual reprentation of it.
That's not the concern of a database. The database's role is to store data as efficiently as possible - and to make it easy to update and to retrieve the data. It's perfectly possible to store, say, an accented letter in 8 or 16 bits. Provided you have an agreed coding system, you can easily translate between the stored characters and the HTML entity at the time you want to display it.
In fact, this is true even without Unicode. It's true that Unicode lets you store a much greater range of characters, but, if that's not a requirement, you can happily use an 8-bit code.
It's perfectly possible to store, say, an accented letter in 8 or 16 bits. Provided you have an agreed coding system, you can easily translate between the stored characters and the HTML entity at the time you want to display it.
While this is true it's important to bear in mind that encoding character into entities (or anything else) will mean losing database support for the character set that is used. This will depend on the database but MySQL has an extensive support for many character sets so if you use the proper character set (or Unicode) then the database can easily do things like sorting alphabetically, changing character case, converting to other character sets, searching in case-insensitive manner or even perform relaxed searches where trying to find letter E will also find accented versions of this letter. Additionally, all the MySQL text string functions will not work properly on entity-encoded strings. You may never need those features but it's good to keep this in mind.
Originally Posted by kreut
Thank you for the additional input. If at some point I'll need something more exotic such as Chinese, Unicode will still suffice, correct?
Yes, I don't think Chinese is very exotic for Unicode