Issues with charset and French characters

Hi all,

I’ve never dealt with the display of foreign languages before, or at least I have never had any issues with it.

I’m coding a French version of an existing website and am having issues with getting consistent display of special French characters.

For all content pulled from a MYSQL db, the source code and resultant page in the browser look great. However for the non-dynamic content in a simple PHP file, the resultant HTML is all messed up with Ž’s small ^'s and squares and so is the page in a browser.

The charset is iso-8859-1, but if I change it to utf-8 it gets worse and both results show those diamonds with ? in them.

A bit confused - am I missing something simple?

Is ISO-8859-1 the charset of the file, or the output stream?

Charset of the file - what I have declared in the head of the HTML.

That’s not the same thing. There’s a difference between the file’s charset and the charset that you place in the META-tag.

Yes, sorry - I have just saved the file as Unicode UTF-8 No BOM and if I change the meta declaration to UTF-8, it the file works, however the pages calling the dynamic content are now also set to UTF-8 via meta and I am getting the ? in diamonds.

When I set the pages calling dynamic content to Unicode UTF-8 No BOM that does not solve it.

Which charset is the database using?

Seems to be latin1 as far as I can tell.

That could be the problem. The charset of the data (the MySQL posts) and the pages (the PHP files) has to match the charset that the HTTP header declares and the META tag.

To make life easier for yourself, you should use the same character encoding throughout the publishing chain. That means your source files with static text should use the same encoding as your database. Otherwise you’ll have to convert all text retrieved from the database before writing it to the output stream.

Then, of course, your server must declare the very same encoding as you saved your source files with, and which you use in your database. If you use PHP you can send the right HTTP header yourself, using something like this (before anything is written to the output stream):

<?php header('Content-Type: text/html; charset=iso-8859-1'); ?>

Remember that a <meta> tag in your source code is ignored if the server declares the encoding in its Content-Type HTTP header.

Okay thanks both - seems like the easiest thing to do is to change the database charset as it’s a small DB with only a couple of tables and standardise everything else.

What should I be using out of interest? UTF-8, iso-8859-1? All a bit greek to me.

UTF-8 is by far the most versatile of the two. If you know that you will only be writing in Western European languages, one should be as good as the other.

Hi guys, while the thread is still new, id like to add… Now that we settled the charset, we can now retrieve the data on the database and display them perfectly on the site… question… if my variable has “Málaga”, and i need that on my url, i bet the browser would say crazy error things, cause it cant allow that weird a on the URL.

is there a simple php code where i can output back these weird character to their standard character?

‘Málaga’ in a URI would be encoded as ‘M%C3%A1laga’ with UTF-8 (or ‘M%E1laga’ with ISO 8859-1).

You can use [fphp=urlencode]urlencode()[/fphp] in PHP to URL-encode a string, but PHP doesn’t have native support for UTF-8 so it might not be that easy.

If you’re content to use ‘Malaga’ without the accent in your URI, you can use [fphp=strtr]strtr()[/fphp] to translate anticipated accented characters to the unaccented versions. Again, though, you probably can’t use this with strings encoded with UTF-8, since PHP assumes that each character is represented by a single octet.

hmm, I’m thinking of collecting all accented versions of “a” and replace place them with the standard “a”, and do this for all letters with similar cases… and ill turn it into a function to be used for all links… but running through all letters and checking all their accented version, i think its not feasible… isn’t there a way when I echo a string, i could temporarily change the charset for that printing alone, so that the accented “a” will be echoed as “a”?

No, the encoding applies to the entire document.

You may be able to use HTML entities to solve this problem. Try searching the Internet for complete sets. As for Malaga, the a acute can be encoded as á or á (if this message is reading HTML, then it will simply display the a acute; if so, you can encode as &aacute; or &amp#225;)

Entities all begin with an ampersand and end with a semicolon and in between may use either a number or a name, such as aacute. If a number, it is preceded by a pound sign; for a acute, the number is 225.

As for managing large documents, if you are hand coding on a Mac, it may be tiresome to keep typing an entity. Instead, I suggest using the keyboard version of the letter and then doing a search and replace to substitute the entities. To obtain the keyboard version on a Mac, hold down the option key while you type an e; then type a. Other letters, such as c with a cedilla, are a bit easier (option-c). If you are on a PC, the system is a bit more complicated and may be no easier than typing the entity: hold down the Alt key, then type 0225 on the numeric keyboard. You may also be able to use macros or, on the Mac, a useful little program called TypeIt4Me.

I hope that helps!

Well, I see that my previous post, at least as displayed on my Mac, interprets the numeric entity, but not the name one. Further, it interprets semicolon-close paren as a smiley. To put it into words, the numeric code for a acute is “ampersand-pound sign-225-semicolon”.

Thanks rosenvinge! This method would then need a complete list of all HTML entities for all special characters right? I thought that if i go compile that list, the looping would take a huge time and might not be a sound solution particularly if my page would have dozens of links which needs this function.

for the moment, im makinguse of this code i found on the web. Might be useful for the readers of the site. Its written in php.

function safeurl($str) {
	$special = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿRr";
	$standard = "AAAAAAACEEEEIIIIDNOOOOOOUUUUYbsaaaaaaaceeeeiiiidnoooooouuuyybyRr";
	$str = utf8_decode($str);
	$str = strtr($str, utf8_decode($special), $standard);
	return utf8_encode($str);
}

I can’t help you with the server-side implications, because I’m a graphic designer hand coding small sites. I’m not sure whether by links you mean links as such or the documents they point to. For my own purposes, I don’t use accented characters in file names, only in the html files themselves. So a file named malaga-dot-htm might include many instances of Málaga.

Thanks, but believe me, if i could eliminate their use i would do so. But the data will not be coming on our end, thats why we have to support them. So if someone with a keyboard supporting accented a types it in out form, i cant pass that via GET/URL unless i change the accented a to the usual a.