SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 38
  1. #1
    SitePoint Evangelist stef25's Avatar
    Join Date
    Nov 2004
    Location
    belgium
    Posts
    465
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    french characters

    this question surely has been asked before but i couldnt seem to find an answer. whats the best way to process and store characters like and (only w-european, so german, french etc. not asian characters) properly in a db?

    they are entered as and in a rich text editor and then sent to the db. they should be displayed as é and ü respectively. along with these characters, html markup from the RTE is also sent to the db. this should be printed out literally on the page, not as entities (this would have the markup displayed instead of rendered)

    i find myself going around and around in circles, trying out various php functions but every one seems to be produce worse results. site is in utf-8. i posted this in php, not mysql since afaik its a string processing issue more than anythign.

    thanks for any advise!
    I need someone to protect me from
    all the measures they take in order to protect me

  2. #2
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    European accentuated characters corruption is a stamp of UTF-8. Use ISO-8859-1 instead
    Quote Originally Posted by stef25 View Post
    (Thu 30 Apr 2009 12:57 GMT)
    site is in utf-8
    This may be the reason. Each time you see corruption happen on or around European accentuated characters, you know the site is encoded in UTF-8. The probable cause is that UTF-8 is way too complicated for most programmers, so it gets wrongly implemented in MS products, which corrupts 2 or 3 characters around each European accentuated character as soon as someone in the work flow touches its source in MS software, like trying to edit the HTML source in OE (Outlook Express), or in many other circumstances more mundane thus more frequent but also more difficult to identify. Due to Microsoft products' ubiquity, this is very frequent, which is why people in Europe tend to revert from UTF-8 to ISO-8859-1 (don't use ISO-8859-15 either; it's also a fast-thought "solution", with drawback smaller than UTF-8, but benefit smaller as well). In USA the revert is slower because many people never see the damage UTF-8 causes since they use no European characters. For Long URLs, Accentuated Chars, encode as Quoted-Printable, Western European (ISO), use "EUR" for Euro symbol explains how to do in OE; it's very similar in most other mail clients.

    Versailles, Thu 30 Apr 2009 16:45:50 +0200, edited (added title) 16:48:45

  3. #3
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Can you post example output from the editor? I'm not entirely sure what you mean.

    @Michel Merlin: Uh, are you sure? No Microsoft program has given me trouble with UTF-8 yet. It sounds like you are confusing your problem with reading a file in the wrong encoding, because most characters in UTF-8 are multi-byte and so they appear as "corruption" around accented characters if read in a single-byte encoding.

  4. #4
    SitePoint Evangelist stef25's Avatar
    Join Date
    Nov 2004
    Location
    belgium
    Posts
    465
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    example:

    when i enter ééééééççç into the RTE it puts this in the db: ééééééççç

    the only processing ive done on this is mysql_real_escape_string

    ---

    edit: ill be damned. changing the character encoding of the page via a meta tag from utf-8 to ISO-8859-1 seems to fix it. i always thought utf-8 was the best all round encoding to use?
    I need someone to protect me from
    all the measures they take in order to protect me

  5. #5
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Did you change the meta tag on the editor page or the page that displays the text?

  6. #6
    SitePoint Evangelist stef25's Avatar
    Join Date
    Nov 2004
    Location
    belgium
    Posts
    465
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    changed on the page that inserts and displays the text, didnt touch any settings in Aptana
    I need someone to protect me from
    all the measures they take in order to protect me

  7. #7
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    See the corruption happen (on UTF-8-encoded European Accentuated Characters)

    See the corruption happen (on UTF-8-encoded European Accentuated Characters)

    Quote Originally Posted by sk89q View Post
    (Thu 30 Apr 2009 16:15 GMT)
    No Microsoft program has given me trouble with UTF-8 yet.
    This is because you are probably American and accordingly use no EACs (read further) which ensures you to remain totally unaffected.
    Quote Originally Posted by sk89q View Post
    It sounds like you are confusing... because most characters in UTF-8 are multi-byte and so they appear as "corruption" around accented characters if read in a single-byte encoding.
    This is precisely what I am addressing: UTF-8 is too complicated (e.g. variable character length makes much harder to predict the place of a character in memory), which is way too difficult to understand and apply reliably for contemporary programmers. The actual result, in real life, is that many email (and a few web pages) get botched along the workflow when encoded in UTF-8. Proof is in 2 below: many companies, when encoding in UTF-8, strip their French text from all EACs (rending them hard and unpleasant to read). Of course you wouldn't see so many stripped emails if UTF-8 was without problem.

    1) An example of OE systematically corrupting 2 characters around each European Accentuated Character when trying to edit the source of an HTML message in UTF-8 was given with explanations and images in my message "Please post successful test of source-editing UTF-8 European HTML" posted Sun 21 Jan 2007 16:39 GMT in the The Definitive Guide to Web Character Encoding thread. It can help as well reading more in that 4-page thread (don't be impressed by the ones posting hot air when lacking time to research facts and arguments).

    2) Here's an example showing:
    • a message with EAC (European Accentuated Characters), properly prepared in ISO-8859-1, posted on a web site (no matter whether encoded in ISO-8859-1 or UTF-8):

    • the reply sent in an email encoded in UTF-8:


    These 2 screen dumps show the phrase, properly written on the site, and corrupted in the UTF-8 email reply. That example is quite representative of what happens daily in France, where of course we have to deal with a lot of EACs and a lot of subsidiaries of American companies writing in UTF-8 (using no EACs, they don't see the damage UTF-8 causes, hence they continue imposing UTF-8 onto their FR subsidiaries).

    In France everyone receive plenty emails with EACs either corrupted or replaced with Unaccentuated ASCII characters, like:
    • "dans des circonstances hlas trs frquentes, 2 caractres autour de chaque caractre accentu" being replaced with, either:
    • "dans des circonstances hélas très fréquentes, 2 caractères autour de chaque caractère accentué", or:
    • "dans des circonstances helas tres frequentes, 2 caracteres autour de chaque caractere accentue".
    (This message is written in ISO-8859-1 and the above phrase should appear correctly, depending on your system. If not, please find it in the 2 images attached).

    Each time you receive such a corrupt (or preventively stripped) email, it is in UTF-8, most often sent by a big company, often a subsidiary of an International Co. Oppositely, EACs in email messages are carried with no problem throughout the whole workflow as long as they remain encoded in ISO-8859-1 though all stages. This may be why so many large sites have been encoding in ISO-8859-1 for so long, especially when handling email (I just noticed though that yahoo mail has recently switched from ISO-8859-1 to UTF-8, but I am not sure this won't worsen their problems).

    Now this applies mostly to email and to current situation. Many large sites are successfully encoding web sites in all languages in UTF-8 (e.g. Wikipedia, but this forces them to fence visitors in a proprietary and inconvenient editor); and the problem would disappear, either if Microsoft deigned to overhaul its HTML handling system (programs and DLLs, for web and email) until they really fixed this, or if the market sufficiently turned away from Microsoft products. Unfortunately this is NOT the reality yet, so everyone wanting to write emails that will remain properly written and easy to read whatever path they follow in the workflow, has so far to encode in a charset specific to their contents (ISO-8859-1 for Western languages, or whatever for others, like Cyrillic, Asian, Arabic, etc).

    Versailles, Thu 30 Apr 2009 22:54:25 +0200, edited (title, images) 23:05:10
    Attached Images Attached Images

  8. #8
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michel Merlin View Post
    2) Here's an example showing:
    * ...
    * the reply sent in an email encoded in UTF-8:
    ...
    I should have mentioned (some American readers may not know or guess it) that:
    • my OE being properly configured (which already appeared if reading my 1st post in this thread and its link), in "View > Encoding" the Reply when received is duly set initially as "Unicode (UTF-8)"; all its EACs in the part repeating my original post are botched, all others are OK;
    • if in "View > Encoding" I set it as "Western European (ISO)", then the EACs original to Amazon get botched as well (and the others remain botched of course, or even get further worsened)
    • similar exchanges (posting on a site, receiving copy of Parent Post in an email reply) are very frequent and carry NO corruption when the reply is encoded in ISO-8859-1 (which generally implies that the site is too)
    This confirms conclusions drawn after plenty forum discussions, and tests I made over years, from various email accounts and web mails, to other mail or web accounts, through different PCs set in different ways: that corruption arises most often through long paths and always involves UTF-8.

    Now I am open to more inquiry and tests (and will recontact my Amazon counterpart to do so if they accept).

    Versailles, Fri 1 May 2009 00:01:40 +0200

  9. #9
    SitePoint Evangelist stef25's Avatar
    Join Date
    Nov 2004
    Location
    belgium
    Posts
    465
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    im pretty sure Michel Merlin is on the right track ...
    I need someone to protect me from
    all the measures they take in order to protect me

  10. #10
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michel Merlin View Post
    See the corruption happen (on UTF-8-encoded European Accentuated Characters)

    This is because you are probably American and accordingly use no EACs (read further) which ensures you to remain totally unaffected.
    Yeah, that would be true if I didn't run websites with international visitors who did not post things in just English.

    Quote Originally Posted by Michel Merlin View Post
    This is precisely what I am addressing: UTF-8 is too complicated (e.g. variable character length makes much harder to predict the place of a character in memory), which is way too difficult to understand and apply reliably for contemporary programmers.
    No, it's not. UTF-8 is very well-designed in regards to multi-byte characters. You can very easily know how many bytes an upcoming character takes up by looking at the lowest bits in the first byte. What's hard about implementing UTF-8 is displaying it, because you have things like control characters and character composition. However, in this situation, French is a simple language compared to other languages, because it does not need control characters and no one will bother with character composition. Arabic and Vietnamese are different cases.

    I was able to receive an email in OE just fine right now. However, I cannot reply more on the subject because I am currently on a slow RDP session, and I'm not going to bother reading other threads at the moment. What I did see from what you did post has no bearing, however. You need to post a hex dump from a packet sniffer of an email encoded in UTF-8 that is displaying improperly. Only then can you properly pin-point the source of the issue.

    ----

    @stef25:

    When you changed the meta tag, did the old characters that were already in the database started showing correctly, or did you test characters that you inputted in the editor after you made the change to the meta tag?

  11. #11
    SitePoint Evangelist stef25's Avatar
    Join Date
    Nov 2004
    Location
    belgium
    Posts
    465
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    @sk89q i was still in the testing phase so just emptied the db and started from scratch. i dont think the iso character encoding would have displayed characters like ééééééççç properly.
    I need someone to protect me from
    all the measures they take in order to protect me

  12. #12
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, in the original situation, "ééééééççç" was properly saved as UTF-8. It looked like "ééééééççç" "in the DB" because whatever you were using to look at the DB was showing the characters using the wrong encoding. When it was printed back into the browser, the browser was using the wrong encoding too.

    Now, the odd thing is that you say that both the page where you input the characters and the page that display the characters are one and the same, so it should come out the same as you had inputted it. However, I think there's something wrong with the diagnosis here, so you should post a sample for us to see, because something isn't adding up from what you've told.

    By the way, you should send a HTTP header to declare the encoding, not use a meta tag.
    PHP Code:
    header("Content-Type: text/html; charset=utf-8"); 

  13. #13
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Request for help testing EACs in UTF-8 email

    Request for help testing EACs in UTF-8 email
    Quote Originally Posted by sk89q View Post
    (Thu 30 Apr 2009 23:02 GMT)
    Yeah, that would be true if I didn't run websites with international visitors who did not post things in just English
    Do you reply them in their European language and with UTF-8-encoded email? If so, it would be helpful that you tell me the URLs for these forms, so I post, you reply, and we can find better where and how the corruption eventually occurs, and where and how the situation could be improved (be it on my system, on your site, in email servers, or else). TIA for this, which would be highly useful (I recall that many European companies avoid either UTF-8 or EACs due to this annoying issue)
    Quote Originally Posted by sk89q View Post
    UTF-8 is very well-designed... What's hard about implementing UTF-8 is displaying it...
    Precisely what I said: UTF-8 is beautiful in theory, hard to implement in reality - and, in too many cases, actually badly implemented and/or used.
    Quote Originally Posted by sk89q View Post
    French is a simple language compared to other languages, because it does not need control characters and no one will bother with character composition. Arabic and Vietnamese are different cases.
    I am not sure what you mean "character composition", and not sure double-accentuated chars used in Vietnamese need significantly more complicated composition than dead keys (used in FR and other Western European languages).

    Versailles, Mon 4 May 2009 18:49:40 +0200 (added title 18:56:50)

  14. #14
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    EACs seem to travel in stef25's DB unharmed with ISO-8859-1, corrupt with UTF-8
    Quote Originally Posted by stef25 View Post
    (Thu 30 Apr 2009 20:34 GMT)
    when i enter into the RTE it puts this in the db: ééééééççç
    ..............
    changing the character encoding of the page via a meta tag from utf-8 to ISO-8859-1 seems to fix it. i always thought utf-8 was the best all round encoding to use?
    Indeed, while gurus infinitely pontificate that UTF-8 (or ISO-8859-15) is heaven, silent reality checks show that (in email at least) reliability (thus convenience to audience) is often better achieved through ISO-8859-1.
    Quote Originally Posted by stef25 View Post
    (Thu 30 Apr 2009 23:14 GMT)
    i dont think the iso character encoding would have displayed characters like ééééééççç properly.
    I guess you meant I think that, if UTF-8 had been replaced with ISO-859-1 all the long of the process, then the "" characters would have been displayed as "", not as "ééééééççç" . Anyway I think, like you apparently, that if you have the time for it, continuing the test you appropriately started (replacing UTF-8 with ISO-859-1 through the whole process) would teach us more than the various hypotheses we could all post.

    Versailles, Mon 4 May 2009 18:50:20 +0200

  15. #15
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Incoherences in propagation of UTF-8
    Quote Originally Posted by sk89q View Post
    (Fri 1 May 2009 00:13 GMT)
    in the original situation, "" was properly saved as UTF-8. It looked like "ééééééççç" "in the DB" because whatever you were using to look at the DB was showing the characters using the wrong encoding. When it was printed back into the browser, the browser was using the wrong encoding too
    You are assuming that 2 errors have been made in the report tool and in the browser (independent or propagated), while zero error has been made from keybd to DB or could be due to UTF-8. I don't see a reason to be so oriented and so sure so early. In facts many very similar problems arise in many different situations, that all involve UTF-8, and seem to come from misunderstandings in or between all or part of UTF designers, OS and application programmers, writers, editors.
    Quote Originally Posted by sk89q View Post
    Now, the odd thing is that you say that both the page where you input the characters and the page that display the characters are one and the same, so it should come out the same as you had inputted it.
    This is precisely the problem, that while never arising with specific charsets (like ISO-8859-1), is widely spread with UTF-8 (which made many European companies avoid either UTF-8 or EACs), and of which I reported one case with details and images in the link in 1 of my "See the corruption happen" of Thu 30 Apr 2009 20:54 GMT above.
    Quote Originally Posted by sk89q View Post
    However, I think there's something wrong with the diagnosis here
    Everyone easily thinks the test is wrong when it brings back a result different from their initial pre-built idea.
    Quote Originally Posted by sk89q View Post
    By the way, you should send a HTTP header to declare the encoding, not use a meta tag.
    This is one more of the official stances that, while infinitely repeated, are actually never really thought and checked. Personally I strongly think that W3C should state officially that the charset should be set in one single place, and that the best place is inside the document and under the eyes of the user, i.e. in the META tag. Notice that MS silently changed this a few years ago: now at the user level, while composing in OE, you can set the charset either by "Edit pane > Format > Encoding" or by "Source pane > META", it will propagate properly in either direction (it didn't earlier, inducing people into doubt and errors).

    Versailles, Mon 4 May 2009 18:51:15 +0200

  16. #16
    SitePoint Wizard spence_noodle's Avatar
    Join Date
    Jan 2004
    Location
    uk, Leeds (area)
    Posts
    1,264
    Mentioned
    2 Post(s)
    Tagged
    1 Thread(s)
    Try this to see if this helps, place this code, a mysql query before the insert or update query:

    You need to specify which character set you are sending to the database because MySQL needs to know.

    Code MySQL:
    mysql_query(" SET NAMES 'uft8' ");

    More info is here. It's all to do with mysql storing characters, each character is stored as two bytes.

    A snipet from the page:
    "UTF-8 uses one or more 8-bit bytes to store a single character, unlike ASCII and friends which use only one byte per character...

    ...As an example, let's take a pound sign (a real pound sign for you non-British types who call a hash a pound). In ISO 8859-1, the £ character has an ordinal value of 163 (0xA3 in hex) and by coincidence (or not), its Unicode code point is U+00A3. However, UTF-8 cannot store values above 127 in a single byte -- the encoding demands we use two. Omitting the grizzly details of the actual encoding process, you end up with the 2-byte sequence 0xC2A3, which just happens to correspond to the string "£" when expressed in ISO 8859-1
    I've tried the above query today and it's worked out fine because I had a problem with the pound sign where it would be added into the database as '£'. Since the query the problem has been solved I hope it will help you.
    "Don't you just love it when you solve a programming bug only to create another."

  17. #17
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michel Merlin View Post
    Do you reply them in their European language and with UTF-8-encoded email? If so, it would be helpful that you tell me the URLs for these forms, so I post, you reply, and we can find better where and how the corruption eventually occurs, and where and how the situation could be improved (be it on my system, on your site, in email servers, or else).
    I don't have anything you can test right now, but here is an email sent in PHP and received in OE6:
    http://img231.imageshack.us/img231/1208/oe6.png
    The original email was sent in UTF-8 and I replied to myself in UTF-8. I also replied in HTML a secondtime. It came out fine. I also sent the original in ISO-8859-1, converted the encoding to UTF-8 in OE6, and that worked as well.

    Quote Originally Posted by Michel Merlin View Post
    Precisely what I said: UTF-8 is beautiful in theory, hard to implement in reality - and, in too many cases, actually badly implemented and/or used.I am not sure what you mean "character composition", and not sure double-accentuated chars used in Vietnamese need significantly more complicated composition than dead keys (used in FR and other Western European languages).
    Implementing basic UTF-8 isn't that difficult. It's no different from any other multi-byte encodings. There are plenty of programs that don't correctly display the special features of UTF-8, but I have not seen that fails to show regular characters.

    By character composition, I mean the combining characters in Unicode. Dead keys are a way of inputting characters, but as far as encoding them go, you can either encode the diacritic character as a separate character or combine the diacritic with the original character and result in a whole new character. Most encodings have stuck to tha latter because it is so much simpler to implement. In the case of French, you don't have a ton of diacritic combinations where the use of combining characters becomes necessary.

    For example, "é" can be represented as either "é" or "e + ́ ". The " ́" is special, because it will be drawn over the previous character. That's what a combining character is, and a lot of programs do not support this feature of Unicode at all.

    You can see a list of combining characters here:
    http://sk89q.therisenrealm.com/playg...odecombinding/
    The "X" and diacritics are entirely two different characters on that page.

    Quote Originally Posted by Michel Merlin View Post
    EACs seem to travel in stef25's DB unharmed with ISO-8859-1, corrupt with UTF-8Indeed, while gurus infinitely pontificate that UTF-8 (or ISO-8859-15) is heaven, silent reality checks show that (in email at least) reliability (thus convenience to audience) is often better achieved through ISO-8859-1.I guess you meant « I think that, if UTF-8 had been replaced with ISO-859-1 all the long of the process, then the "ééééééççç" characters would have been displayed as "ééééééççç", not as "ééééééççç" ». Anyway I think, like you apparently, that if you have the time for it, continuing the test you appropriately started (replacing UTF-8 with ISO-859-1 through the whole process) would teach us more than the various hypotheses we could all post.

    Incoherences in propagation of UTF-8You are assuming that 2 errors have been made in the report tool and in the browser (independent or propagated), while zero error has been made from keybd to DB or could be due to UTF-8. I don't see a reason to be so oriented and so sure so early. In facts many very similar problems arise in many different situations, that all involve UTF-8, and seem to come from misunderstandings in or between all or part of UTF designers, OS and application programmers, writers, editors.
    Yes, I do make that assumption, because it's generally true around these parts. However, I asked if the OP could post an example link, because trying to find out the problem personally is much easier.
    Last edited by sk89q; May 4, 2009 at 11:24.

  18. #18
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by sk89q View Post
    (Mon 4 May 2009 18:45 GMT)
    an email sent in PHP and received in OE6:
    http://img231.imageshack.us/img231/1208/oe6.png
    The original email was sent in UTF-8 and I replied to myself in UTF-8. I also replied in HTML a secondtime. It came out fine. I also sent the original in ISO-8859-1, converted the encoding to UTF-8 in OE6, and that worked as well.
    Thx sk89q. This, while with few EACs, shows at least one case where input on a form, replied in an email, conveys EACs in UTF-8 properly. Which BTW adds to the odds that a number of UTF-8 problems are linked to MSW (MicroSoftWare).
    Quote Originally Posted by sk89q View Post
    "" represented in UTF-8 consists of the same bytes as "ééééééççç" as represented in ISO-8859-1/CP1252
    Thx. I should have tried this earlier. Indeed, pasting (as you said) "ééééééççç" in an ISO-8859-1 message in OE6, saving it, then "View > Encoding > Unicode (UTF-8)" does show "" - which does in turn support your assumption that there was no problem until saved in DB.
    Quote Originally Posted by sk89q View Post
    Implementing basic UTF-8 isn't that difficult... I have not seen that fails to show regular characters.
    As I reported a couple times, conveying text with UTF-8, 1st causes corruption on EACs as soon as the workflow enters some circumstances that unfortunately are very frequent, 2nd never corrupts ASCII chars - whence American being mostly unaffected. So "regular characters" actually are affected if they include EACs, but are NOT if they are only ASCII.
    Quote Originally Posted by sk89q View Post
    Now I found it odd because...
    This is typical of the problems with UTF-8. The reasonably best in real life is to get rid of UTF-8 and use ISO-8859-1 instead. If OTOH one wants to go further, then tests as you just added are useful, but they require much more precision than you just gave - which unfortunately is time-consuming hence not often possible. Thx anyway for what you did.

    PS. You must think as me on this last point since I just see that you removed it. That test would be useful though (if someone have the time to do and report it with the precision required), so I leave my reply on it anyway.

    Versailles, Mon 4 May 2009 21:47:15 +0200

  19. #19
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    By "regular characters," I meant everything in the Unicode repertoire that can be printed as-is onto the screen character after character. I say that because just parsing Unicode isn't too hard, and displaying those characters is no different than displaying ASCII.

    Pinpointing where the error in encoding occurs isn't difficult or time consuming, but you do need to know what you are doing. Having the OP perform the tests would be more work for me and the OP than I would want, because a lot of things in the environment could be affecting the result that I couldn't control for (unless I spend time writing out very specific instructions, which I surely don't want to do). That's why I removed that part of my post. I didn't see your post when I had edited my post either.

    I don't find this to be a reason to avoid UTF-8. You wouldn't avoid a feature of PHP just because you haven't been educated on it. Plus, eventually, you will have to accept UTF-8 because people from different countries will be visiting your website, and they may post things in different languages.

  20. #20
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    ISO-8859-1 conveys all Unicode chars flawlessly

    ISO-8859-1 conveys all Unicode chars flawlessly
    Quote Originally Posted by sk89q View Post
    (Mon 04 May 2009 22:28 GMT)
    ...just parsing Unicode isn't too hard, and displaying those characters is no different than displaying ASCII.
    On Thu 30 Apr 2009 23:02 GMT you said the contrary (see, in my post of Mon 04 May 2009 16:49 GMT, my 2nd quote of you: "What's hard about implementing UTF-8 is displaying it").
    Quote Originally Posted by sk89q View Post
    Pinpointing where the error in encoding occurs isn't difficult or time consuming, but you do need to know what you are doing
    The facts is that such test is difficult and time consuming, as shown in many occurrences, including your own admission (difficult: "because a lot of things in the environment could be affecting the result that I couldn't control for", time consuming: "would be more work for me and the OP than I would want... unless I spend time...").
    Quote Originally Posted by sk89q View Post
    I don't find this to be a reason to avoid UTF-8
    Many big international companies do find however, and get rid of non-ASCII char corruption when, either replacing UTF-8 with ISO-8859-1 (or else, see below), or replacing EACs with ASCII downgraded equivalents.
    Quote Originally Posted by sk89q View Post
    you will have to accept UTF-8 because people from different countries will be visiting your website, and they may post things in different languages.
    This is again the infinitely repeated theory. In real life, replacing Unicode (UTF-8) with ISO-Western-European (ISO-8859-1, "the default character set in most browsers"), or with Cyrillic, Arabic, Asian, or else (depending on the language primarily used on the site involved), will remove this corruption for good, and will at the same time silently convey the few characters NOT belonging to the initial charset. That convoying is done throughout browsers and other agents all the long of the workflow, using HTML entities, or better, NCRs, which works flawlessly in ISO-8859-1 (or other fixed-length charsets), but not as reliably so far in UTF-8.

    More precisely, I recall that ISO-8859-1 suffices to directly represent practically all characters for all western European languages (English, Spanish, German, French, Italian, Portuguese), and that on a site in one of these languages, it is rare that a visitor posts in another language, in which case his text, while usually carrying many characters not covered by ASCII, will have few not covered by ISO-8859-1, he will probably replace them instinctively (as standard behavior) with combinations of usual European characters that are covered (people even go as far as usually replacing "" or "", despite they are part of ISO-8859-1, with "ss" or "ae"), and even if he doesn't, those characters will get conveyed anyway flawlessly and silently throughout the workflow if using any non-UTF charset, as told above.

    As said, other fixed-length charsets will do too, but ISO-8859-1 is the one that will, globally in real world, minimize the ponderous use of NCRs, and that being the default, will cause the less problems. Others are often not as thoroughly thought and tested; for instance the Euro typographical symbol "", if conveyed using ISO-8859-15 (Latin-9), will too often get translated into "" then eventually "$" down the workflow, sparking ambiguities and errors; which is why I prefer to use ISO-8859-1 and the Euro FINANCIAL symbol "EUR", that both are understood, written, read flawlessly and unambiguously by any person or program around the world.

    Now of course UTF-8 will be better when it is finished, well tuned and reliable, yet this is unfortunately not the case so far. Which is why, in real life and for now, we need IMO to temporarily switch back from UTF-8 to ISO-8859-1 (as said in the link in my 1st post above, EAC corruption is a stamp of UTF-8. Use ISO-8859-1 instead).

    Versailles, Tue 5 May 2009 18:39:50 +0200

  21. #21
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Replacing UTF-8 with ISO-8859-1 removes EAC corruption

    Replacing UTF-8 with ISO-8859-1 removes EAC corruption


    I notice that none has been able to bring any test challenging my opinion (and test) that Replacing UTF-8 with ISO-8859-1 removes EAC corruption:
    • this EAC corruption is widely spread, and always denotes usage of UTF-8; it never happens if the whole flow is handled in other charsets (see above)
    • you (sk89q) admitted (Mon 4 May 2009 18:45 GMT: "I don't have anything you can test right now") being unable to support your claim of Thu 30 Apr 2009 23:02 GMT (that you run websites with international visitors who did post in NOT just English); your "but here is an email sent in PHP and received in OE6: http://img231.imageshack.us/img231/1208/oe6.png", as you apparently admitted, cannot replace such a test - which, I recall, would just imply that you give me an URL where I can post a few EACs on (one of) your site(s), then that you reply your usual way (in UTF-8, with including a copy of my EACs, adding a few EACs of yours, and taking no special precautions)
    • stef25 (the OP) reported that corruption seems to disappear if he replaces UTF-8 with ISO-8859-1 (see my quote on Mon 04 May 2009 16:50 GMT). We just need he extends that replacement throughout his system (input, DB, report, display, print), and reports us the result, so to better see if such switch away from UTF-8 actually and totally removes his problem or not
    • Amazon.fr, to whom I submitted on Mon 4 May 2009 16:13:20 +0200 the test I announced Thu 30 Apr 2009 22:01 GMT (last line), managed to elude the test (I am open should they change their minds)
    • A large part (but not the whole) of that corruption comes from MSW (MicroSoftWare); but MSW is an overwhelming majority, whether we like it or not, on the market, hence in the many and unpredictable persons and systems where our email will transit; hence, we have to take it in account, and know that as long as UTF-8 is not cured (whether in its definition, or construction, or implementation, in MSW and elsewhere, or in its usage), EACs are highly exposed to corruption whatever we do when sending them in an UTF-8-encoded email message. I do acknowledge that UTF-8 is promising, but we all have to recognize that UTF-8 does NOT fill that promise so far, and still needs improvement
    • I recall that UTF-8, when encoding ASCII, brings no drawback, but no benefit either; yet draws big risk of corruption as soon as EACs get involved.
    So in conclusion, I still have no reason so far (yet I remain open to any argument or test - or help, in case some of the errors I observe would be on my side) to not maintain and confirm: while waiting for UTF-8 being made more reliable throughout the workflow, I recommend to stay away from UTF-8 in email and replace it with ISO-8859-1, as many European companies and individuals are doing.

    Versailles, Tue 5 May 2009 18:46:15 +0200

  22. #22
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michel Merlin View Post
    ISO-8859-1 conveys all Unicode chars flawlesslyOn Thu 30 Apr 2009 23:02 GMT you said the contrary (see, in my post of Mon 04 May 2009 16:49 GMT, my 2nd quote of you: "What's hard about implementing UTF-8 is displaying it").
    No. Please re-read what I wrote. I emphasized a difference between "regular characters" and special features of Unicode (combining characters, control characters, etc.).

    Quote Originally Posted by Michel Merlin View Post
    The facts is that such test is difficult and time consuming, as shown in many occurrences, including your own admission (difficult: "because a lot of things in the environment could be affecting the result that I couldn't control for", time consuming: "would be more work for me and the OP than I would want... unless I spend time...").
    It may be difficult for you and most programmers, but it's not for me or someone else that has a grasp of encodings. The things in the environment are easy to control if I'm the one doing the test, but in this case, I have to work through this forum and covey my troubleshooting instructions through words.

    Quote Originally Posted by Michel Merlin View Post
    you (sk89q) admitted (Mon 4 May 2009 18:45 GMT: "I don't have anything you can test right now") being unable to support your claim of Thu 30 Apr 2009 23:02 GMT (that you run websites with international visitors who did post in NOT just English); your "but here is an email sent in PHP and received in OE6: http://img231.imageshack.us/img231/1208/oe6.png", as you apparently admitted, cannot replace such a test - which, I recall, would just imply that you give me an URL where I can post a few EACs on (one of) your site(s), then that you reply your usual way (in UTF-8, with including a copy of my EACs, adding a few EACs of yours, and taking no special precautions)
    That's because I rarely have to send an email with non-ASCII characters. The content stays on my site, and email is just used for mailing alerts in English. The test I performed with OE6 was done using code I had left over for a mass mailing where I did have to use non-ASCII characters, but it was something written for the command line and so I wouldn't have a form for it.

    Quote Originally Posted by Michel Merlin View Post
    stef25 (the OP) reported that corruption seems to disappear if he replaces UTF-8 with ISO-8859-1 (see my quote on Mon 04 May 2009 16:50 GMT). We just need he extends that replacement throughout his system (input, DB, report, display, print), and reports us the result, so to better see if such switch away from UTF-8 actually and totally removes his problem or not
    Or we can actually fix the problem. The problem more or less is caused by the OP's (no offense here) inexperience with encodings.

    Frankly, I find it hard to believe you because you do not appear to have that firm grasp of how encodings work. The fact that you weren't able to recognize the garbage earlier as merely improperly displayed -- but not corrupt -- UTF-8 indicates this. Someone with a strong understanding of the workings of encodings would have been able to notice this easily.

    Now, I can set up a form and all for you to test it, but I do not trust if you can properly troubleshoot the situation. Yes, you can give me a ton of screenshots, but unless I can play with OE, run a packet sniffer, or do whatever else necessary, I cannot actually make any conclusion based on the fact that "it looks wrong for you." On the other hand, if I could receive a UTF-8 encoded email in OE that doesn't come out right, then perhaps I could believe you. Unfortunately, I was not able to reproduce the problem in OE6 as of yet, so until then, I can't say that I agree with you.

    As for the web, UTF-8 works completely fine. It's always the error of whoever designed the website if a page comes out wrong. Internet Explorer successfully handles UTF-8, and so does every other major browser. In fact, IE features pretty good Unicode support compared to most Unicode-aware software. The only issue with UTF-8 in IE (6) is a potential XSS vulnerability, but if you sanitize all data, or not use PHP (and use something that is Unicode-aware and strict), then there's no issue. PHP doesn't have trouble working with UTF-8 data at all (in the respect that it won't corrupt it unless you try to use its string functions), because it treats all stirngs blindly as 8-bit streams of data. MySQL is Unicode and encoding-aware and has been for a while (and before that, it just treated everything like PHP does now). Everything else that touches the data, including the OS and Apache, do not care about encodings.

    And beyond using Unicode for other languages, there are a lot of typographical symbols and such that you can only find in Unicode, such as the em and en dashes. You can use HTML entities in regards to an HTML page, but that is not feasible if you have to do anything with the data outside the realm of a webpage. At that point, you will have to use UTF-8.

  23. #23
    SitePoint Addict
    Join Date
    Dec 2008
    Location
    Brussels
    Posts
    377
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I've been making a guestbook last days and had problems with french chars too.

    Tried lots of stuff with:
    - htmlentities
    - specialchars
    - strip_tags
    - mysql_real_escape_string
    - mysql_query(" SET NAMES 'uft8' ");
    - iconv(”ISO-8859-1″,”UTF-8″,”$string”);

    None of them worked as I wanted and still gave the wrong characters.

    Than I just changed this one in the head of the page:
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

    to this one:
    <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1" />
    Now I don't have problems anymore.

    Check: http://bulevardi.be/gloom/guestbook.php or on the whole site:
    http://www.gloommusic.com

  24. #24
    SitePoint Zealot Michel Merlin's Avatar
    Join Date
    Mar 2005
    Location
    Versailles (France)
    Posts
    169
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Can you test the same form in UTF-8 and ISO-8859-1?

    Can you test the same form in UTF-8 and ISO-8859-1?
    Quote Originally Posted by bulevardi View Post
    (Thu 7 May 2009 15:07 GMT)
    Than I just changed this one in the head of the page:
    <meta... content="...charset=utf-8" />

    to this one:
    <meta... content="... charset=ISO-8859-1" />
    Now I don't have problems anymore.

    Check: http://bulevardi.be/gloom/guestbook.php
    or on the whole site: www.gloommusic.com
    Thanks bulevardi for this very helpful test.

    I notice that while the 2nd link is in ISO-8859-1, it uses frames, so its form is in facts the one shown in the 1st link, so it is also in ISO-8859-1.

    Could you repost the initial form, in UTF-8, under an URL like http://bulevardi.be/gloom/guestbook-UTF8.php ? This would let us all post the same thing on both forms (UTF-8 and ISO-8859-1), hence add another very useful layer to your interesting test. TIA for anything you could do,

    Versailles, Thu 7 May 2009 20:12:40 +0200

  25. #25
    SitePoint Addict
    Join Date
    Dec 2008
    Location
    Brussels
    Posts
    377
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Michel Merlin View Post
    Could you repost the initial form, in UTF-8, under an URL like http://bulevardi.be/gloom/guestbook-UTF8.php ? This would let us all post the same thing on both forms (UTF-8 and ISO-8859-1), hence add another very useful layer to your interesting test. TIA for anything you could do,
    I made 2 identically guestbooks (don't mind the poor layout), just a quick copy.

    http://bulevardi.be/gbISO88591/guestbook.php

    http://bulevardi.be/gbUTF8/guestbook.php

    Maybe the outcome is different when using different browsers, on different places on earth, etc... Let's have a try.

    Don't know if the sql syntax is important to some of you?

    CREATE TABLE IF NOT EXISTS `gbISO88591` (
    `id` int(11) NOT NULL auto_increment,
    `ip` varchar(30) NOT NULL default '',
    `naam` varchar(50) NOT NULL default '',
    `email` varchar(50) NOT NULL default '',
    `website` varchar(50) default NULL,
    `bericht` text NOT NULL,
    `datum` varchar(12) NOT NULL default '',
    `tijd` varchar(10) NOT NULL default '',
    PRIMARY KEY (`id`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=47 ;



    CREATE TABLE IF NOT EXISTS `gbUFT8` (
    `id` int(11) NOT NULL auto_increment,
    `ip` varchar(30) NOT NULL default '',
    `naam` varchar(50) NOT NULL default '',
    `email` varchar(50) NOT NULL default '',
    `website` varchar(50) default NULL,
    `bericht` text NOT NULL,
    `datum` varchar(12) NOT NULL default '',
    `tijd` varchar(10) NOT NULL default '',
    PRIMARY KEY (`id`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=47 ;


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •