What character code is responsible?

Hey SP,

I think this will be a quick question.

Once in a while I will get strange characters appearing on my clients site. We are mostly copying and pasting the synopsis descriptions into the database from multiple places on the web. I have never personally seen this happen but apparently the web manager will copy content into the CMS that we have custom built and when we view the Content later we get strange characters appearing.

If you visit: http://www.corelearningresources.com/store/item_display.php?i=2783 you might see what I am referring too.

In the middle of Brendell’s middle name I see strange characters after the s and p and also the E in Metis seems to also be missing. Is this because I am not using the right character type on the column, table or database?

Or maybe it’s the websites you copy the data from that display those characters the wrong way? Did you check the source web site?

I can’t say for sure, but I would presume they are not seeing the strange characters correctly when they copy them.

They copy them? I thought you were copying the content of other websites?

I have done minimal data entry for the clients but it has happened from time to time. For the most part they are copying the descriptions from sites on the web. Apparently, the bizarre characters only show up in the body of copy text after they’ve been added into the database which is what leads me to believe this is a character encoding problem.

Ok, that seems to rule out my idea of bad characters at the source. What charset are you using? UTF-8 ?

For this specific column “book_description” which is a medium-text, PHPmyadmin reports it as being “utf8_unicode_ci”. Which appears to be the default for any table I create. Are you thinking that should be something different? :nono:

Utf8 sounds good. I’m not expert enough to help you out here. Let’s hope Rudy or someone else can step in.

Yes, UTF-8 is fine. Is your website also being sent in UTF-8?

Check the headers using the FireFox HTTP Live Headers plugin, it should say something like Content-Type: text/html; charset=UTF-8

If it doesn’t, send it yourself


header('Content-Type: text/html; charset=UTF-8');

also, make sure the HTML head says to use UTF-8


<meta http-equiv="Content-Type" content="text/html; charset=[s]ISO-8859-1[/s]UTF-8" />

This is for your website, i.e. the website where people paste, not the one they copy from.

The problem may be that the browser sees that encoding type and then borks up the post request.

Just switch everything to UTF-8 and you should be fine :slight_smile:

Awesome! I shall try that and report back.

^ Oops, I did a copy / paste from another website and forgot to change the charset; it should be:


<meta http-equiv="Content-Type" content="text/html; charset=[COLOR="#FF0000"]UTF-8[/COLOR]" />

Well I checked everything and it appears I am using UTF in the database, UTF-8 is detected by Firefox and our Meta tag also states UTF-8.

I had some success with using PHP’s built in htmlentities() function. Here is a comparison between the two.

Not using htmlentities() = http://www.corelearningresources.com/store/item_display.php?i=2783
Using htmlentities() = http://dev.corelearningresources.com/store/item_display.php?i=2783

It appears to work for most of it, but there is still one very strange character located in the 3rd paragraph after the word Threatened

Then her beloved daughter, Zoë, is threatened � and Brendell takes matters into her own hands. To save Zoë, Brendell searches for the stalker and confronts not just a depraved madman but her own fears and prejudices.

I have tried to find out what exactly that character is but to no avail. I can’t seem to make sense of the vast scope of Character Encoding. I even read Sitepoints article on it and it confused me even more :wink: lol.

Here is a link (http://software.hixie.ch/utilities/cgi/unicode-decoder/character-identifier?characters=�) to a website I found that apparently tries to identify character encoding. I can’t seem to make sense of it but perhaps someone more skilled in the ways of Encoding-kung-fu will shed some light. Ps. The site says “(this script is currently broken)” so perhaps all of that data is garbage.

One last thing. Do you tell MySQL you want to communicate with it in UTF-8? There are functions to set that, depending on what connection type you’re using (mysql, mysqli, pdo). Google knows :slight_smile:

If that doesn’t work, you can try [fphp]iconv[/fphp], and if that doesn’t work I don’t know anymore …

What do you mean by telling Mysql to communicate in UTF-8? How would I go about setting that up. The field is set to utf8-unicode-ci… is there something more to it than that?

Also, I just realized that this particular body of text that is giving me grief actually appears correctly (strange characters and all) from phpmyadmin and also from our custom built cms. So I’m starting to think perhaps the font we are using doesn’t support these strange characters that are in the synopsis. Is that possible?

T

Yes, see [fphp]mysql_set_charset[/fphp], mysqli_set_charset or [url=http://php.net/manual/en/book.pdo.php#98659]this post for PDO (depdending on how you connect to the database you need one of those.

Theoretically, yes. But I just tried switching your website to Arial (using chrome dev tools) and that doesn’t work either, so that’s not it.

Success! Kind of … I have set the charset to utf8 and the descriptions appear to be displaying correctly on the public site, but for some reason in the <textarea> boxes in the CMS they display even worse now… Any ideas? Before it was the other way around.

I first considered the different fonts. The front end uses Arial I believe and the back end uses Tahoma. I switched it to Arial and it still looked messed up. I’m sure the clients will be happier with the front end displaying correctly but I would really just like to know what is going on.

Thanks for your help so far any ways :smiley:

Edit:

I checked out the page encoding on the back end and it was set to ISOxxxxx (I dunno, some number). I presume this is because it’s some kind of default for Firefox, despite that fact that the Database is storing data in utf8 and it’s being delivered in utf8 it still doesn’t choose utf8 by default. So I’m guessing the problem is fully solved now. Thank you so much for taking the time to explain that the data has be set to utf8 at every stage of the game.

Rant:
Out of all my years in experience working with web programming I can’t help but feel this is one of the silliest problems I’ve ever encountered. If utf8 is known to be common best wouldn’t all web browsers default to that? Am I missing something in my final conclusion?