SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Member
    Join Date
    Jan 2004
    Location
    Beijing
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    mb_convert_encoding and UTF-8 to GB2312 conversion

    I am currently developing a web application that displays all HTML pages in UTF-8 encoding. The application also contains an online form where users can enter the a message and send it out as an email in GB2312 format. However, if I change the online form's encoding to GB2312 so that the text input by the user is encoded with GB2312, the UTF-8 encoded text in the HTML form gets garbled.

    Therefore, I decided to keep the online form encoded in UTF-8, and use iconv or mb_convert_encoding to convert UTF-8 encoded text into GB2312 (Simplified Chinese). It seems, however, neither iconv nor mb_convert do a 100% thorough job of converting the UTF-8 text. With iconv, certain special characters such as - or , do not get converted properly. And when iconv encounters a character it doesn't recognise, it tends to stop the conversion right there and then, so I only receive half of the converted text up to the point where the unrecognised character was found.

    mb_convert_encoding also has problems recognising certain chinese characters and these characters get garbled during the conversion.

    I'm new to all this utf-8 encoding stuff, so I was wondering if there is a way to provide mb_convert or iconv with the most up-to-date charsets in order to ensure all characters are translated correctly without being garbled. Actually, I'm not even sure if obtaining the latest charsets is the correct solution. Has anybody ever experienced this kind of problem with iconv or mb_convert_encoding? And if so, were you able to find a solution?

    Many thanks for your help.
    Last edited by grumpy_developer; Aug 19, 2004 at 01:13.

  2. #2
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi.

    I think you've stumped us . I am about to start tackling this whole localisation subject myself. Can I ask, which resources are you currently using for information?

    I know this doesn't help you, but the more developers that are aware of character set issues, the more they will beat down Zend's door to properly support Unicode.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  3. #3
    SitePoint Addict
    Join Date
    Mar 2003
    Location
    Germany
    Posts
    216
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    (off-topic)

    As it turns out, I may write my diploma thesis (translation studies) about i18n approaches for websites, maybe developing an XML (i.e. XLIFF) <-> database converter along the way.
    The more I delve into the subject, the more I'm intimidated, though. This is such an incredibly complex field.


    I just thought that maybe Harry could add a few i18n resources to the Resources sticky thread? I mean, more and more people are forced to tackle this, and most people are quite lost, because there's not much info available, at least in the PHP world.

    Besides, I think that most i18n approaches for websites are insufficient, because they are derived from the world of desktop apps, but not every website is a web application. But maybe more on that later.

  4. #4
    SitePoint Member
    Join Date
    Jan 2004
    Location
    Beijing
    Posts
    12
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    Hi.

    I think you've stumped us . I am about to start tackling this whole localisation subject myself. Can I ask, which resources are you currently using for information?

    I know this doesn't help you, but the more developers that are aware of character set issues, the more they will beat down Zend's door to properly support Unicode.

    yours, Marcus
    Unfortunately, I haven't been able to find many resources on the subject outside of the PHP manual's iconv and mb_convert_encoding function descriptions and discussion forums such as Experts Exchange and Sitepoint.

    What I've been doing to work-around this conversion problem is to not use the iconv function at all, and instead, I use a pop-up window which is encoded in GB2312 to allow the user to input data. This way, the text is entered directly into the system as GB2312, eliminating the need to convert it from UTF-8. Not a perfect solution, but it will have to do for now until I learn more about all this localisation stuff.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •