SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    SitePoint Zealot boballoo's Avatar
    Join Date
    Dec 2001
    Posts
    113
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unicode or specific language charset?

    I am the owner of http://www.translationhelp.com and I am debating whether to use Unicode on the pages or not. There are over 500 translators in the database (but not yet fully registered as the site is still in development). Each of these translators needs to create at least two "profile" web pages (one in their source language and one in their target language). This means I need to have the ability to allow the translators to enter their profile information in two separate forms when they register in their respective languages so that the form contents once submitted will display properly in the user's browser ("user" here means potential clients looking for a translator).

    I have heard of some problems with Unicode and browser/computer configuration so I am not sure if that is the best solution. It also seems more complex and therefore more expensive and perhaps more prone to bugs. I am also not sure how to implement Unicode so that the translators can read/type their info into the form in their languages and the pages resulting from the form submission are displayed in Unicode.

    The other solution is to set the form pages to be displayed automatically using the specific language charset for that page and to display the web pages using the charset of that page's language.

    The problem is that I have not had the actual site pages translated yet (that is coming soon) and the common elements (menus etc.) are all in English. This will mean a translator's profile page with a language charset of Japanese (for example) will also have menu items in English. The combination of the two languages is troublesome and the only solution I can come up with is to use Unicode OR display these pages without menus using a target "_blank" to open a new browser window and a couple of images (in English) to close the window or whatever.

    I have been thinking and wondering about this for some time and I could use any help or opinions to get me over the hump of indecision I am stuck at.
    EditFast
    Any Document --> Any Time!
    Web Site Copy Editing & Proofreading


  2. #2
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You don't have a choice.

    Unicode (or, rather, ISO 10646) is the character repertoire used by HTML. What you are thinking about is which character encoding to use. (See the HTML FAQ for more info about this.)

    My recommendation would be to use UTF-8 for multi-lingual sites. Some really old browsers may have an issue with it, but it's not enough to worry about. Using specific encodings for different languages will be a pain. If you edit one of those pages and save it with the wrong encoding, it will be unreadable.

    Another option would be UTF-16, but I think there are problems with browser support for that, so even though UTF-16 might be more efficient for some languages, I'd still recommend UTF-8.
    Birnam wood is come to Dunsinane

  3. #3
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by AutisticCuckoo View Post
    Another option would be UTF-16, but I think there are problems with browser support for that, so even though UTF-16 might be more efficient for some languages, I'd still recommend UTF-8.
    As additional information:
    Quote Originally Posted by Henri Sivonen
    UTF-16 is more compact than UTF-8 only when the number of characters from the U+0800–U+FFFF range exceeds the number of characters from the ASCII range—and the latter includes markup whenever well-known XML vocabularies are used.
    -- http://hsivonen.iki.fi/producing-xml/#utf
    Simon Pieters

  4. #4
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    I had a look on the net but couldn't find an answer to my question. Firefox tells me my page is UTF-8, but I haven't saved the PHP file as UTF-8 in my text editor (it's ASCII encoding according to it) because I keep forgetting to. However, I'm using header() to declare that it's UTF-8.

    Is it really UTF-8 then? Does PHP change the encoding to whatever I tell it to, much like my text editor would?

  5. #5
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The encoding information is sent by your web server, as part of the Content-Type header. It must match the encoding you used when saving the file, or you may get incorrectly displayed characters (depending on the real and the declared encoding, plus the characters you've used).

    If you have saved as ASCII, you won't have any problems with declaring UTF-8, because ASCII is a subset of UTF-8. If you save as ISO 8859-1, however, you can get into trouble if you declare UTF-8, because the upper half of ISO 8859-1 are not valid characters in a UTF-8 encoding (they need to be two or three bytes).

    PHP does not, by itself change either the actual encoding nor the declared one. You can use PHP to send the Content-Type header (using header()), but PHP won't change anything by itself.
    Birnam wood is come to Dunsinane

  6. #6
    I meant that to happen silver trophybronze trophy Raffles's Avatar
    Join Date
    Sep 2005
    Location
    Tanzania
    Posts
    4,662
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Thanks, that cleared it up.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •