SitePoint Sponsor

User Tag List

Results 1 to 12 of 12
  1. #1
    Twitter - @CarlBeckel busy's Avatar
    Join Date
    May 2004
    Location
    Richmond, VA, USA
    Posts
    819
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    utf-8 for a perfect world?

    Note: This is a discussion thread - not a quick answer thread. I'm looking for insight, not a charset recommendation.

    I have always been slightly irritated by the multitude of charsets that are available. While I partially understand why they exist, it seems to me like the charset situation is convoluted and thrown together.

    My main beef is with ISO 8859-1 (Latin-1). As I understand it, utf-8 is the same as Latin 1 and also gives you tons of additional characters.

    Although there arefor issues concerning an optional BOM, why is Latin 1 still so prevalent? Is there any reason to prefer it over utf-8? Is it good to try to do everything in utf-8 (like I have been doing for the past year) for anything using the English language? Should the digital world be moving towards phasing out many of these charsets or do we really need them?

    Is there ever any reason to choose Latin-1 over utf-8 on purpose?

  2. #2
    In memoriam gold trophysilver trophybronze trophy Dan Schulz's Avatar
    Join Date
    May 2006
    Location
    Aurora, Illinois
    Posts
    15,478
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Have you read AutisticCuckoo's article on the subject by any chance? (I recall him answering many of the questions you asked here, but don't have the time to go through it and find them.)

    http://www.sitepoint.com/article/gui...acter-encoding

  3. #3
    SitePoint Addict
    Join Date
    Dec 2007
    Posts
    358
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    For me there is no reason to use any non-unicode encoding when it is possible to use unicode. There is "virtual" reason that Unicode encoding may double the size of the content (UTF-16) but in case of UTF-8 and site content in Latin1 there is no significant difference.
    I'm creating trouble-free Apache, PHP, MySQL installer, WITSuite,
    and use it to setup my development environment.
    Demo, support, contact. Questions?

  4. #4
    Twitter - @CarlBeckel busy's Avatar
    Join Date
    May 2004
    Location
    Richmond, VA, USA
    Posts
    819
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dan Schulz View Post
    Have you read AutisticCuckoo's article on the subject by any chance?
    I believe I have, and it's worth a re-read. That was probably one of the things that got me thinking about encodings in the first place.

    Edit: Actually, I hadn't read that one until now. Thanks for the link!

    Alex - I agree. Personally I've never even tried to use utf-16 for anything yet. Even with the size doubling, content is generally not where I run into size problems so it seems to me like a minor sacrifice to make for the sake of simplicity and standards. I'd love it if I only had to worry about utf-8 and utf-16 in circumstances that require it.
    Last edited by busy; Jul 2, 2008 at 13:51. Reason: Added to post

  5. #5
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by busy View Post
    I have always been slightly irritated by the multitude of charsets that are available. While I partially understand why they exist, it seems to me like the charset situation is convoluted and thrown together.
    It's a quagmire that has evolved over time. The first computers didn't need to speak to one another, so there were no standard encodings. Then you got ASCII and EBCDIC which catered to the needs of American computer engineers in the '60s.

    Quote Originally Posted by busy View Post
    My main beef is with ISO 8859-1 (Latin-1). As I understand it, utf-8 is the same as Latin 1 and also gives you tons of additional characters.
    Not quite. And it's a bit complicated.
    ISO 8859-1 is both a character repertoire and an encoding.
    Unicode is a character repertoire.
    UTF-8 is an encoding used with Unicode.

    The first 128 code positions in Unicode are identical with US-ASCII and ISO 8859-1. They are also encoded the same way in UTF-8 (using a single octet).
    The code positions from 128-255 in Unicode are identical with ISO 8859-1, but they are encoded using two octets in UTF-8.

    Unicode then contains hundreds of thousands of other characters, which are encoded with two, three or four octets in UTF-8. Those characters are unavailable in ISO 8859-1.

    Quote Originally Posted by busy View Post
    Although there arefor issues concerning an optional BOM, why is Latin 1 still so prevalent?
    Mainly for historical reasons. It's the default encoding in many text editors and point-and-click tools under Windows (GNU/Linux mostly uses UTF-8).

    There are also many other components in the publishing chain that default to ISO 8859-1. The Apache http server, the Tomcat web container and the JBoss application server, for instance. It's not trivial to get an Apache/Tomcat/JBoss combo to handle UTF-8 properly (you need to apply an undocumented hack in a server XML config file buried about nine subdirectories deep).

    UTF-16 is not a very good idea for web pages. It's a waste of space, since you'll double the size of the markup (all HTML markup characters are in the US-ASCII range and use a single octet in UTF-8). There's also questionable support in user agents, and then you have the whole endian thing.
    Birnam wood is come to Dunsinane

  6. #6
    SitePoint Member
    Join Date
    Oct 2008
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi.

    I having trouble with UTF-16 and UTF-8 when switching from a windows to a linux platform using Jboss Application Server 4.2.2.GA

    It was mentioned that you need to apply an undocumented hack in a server XML config file buried about nine subdirectories deep

    How do you do that and where?

    Nicolai

  7. #7
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by nicolai View Post
    How do you do that and where?
    The exact location depends on the JBoss version. On my machine the file is %jboss-root%\server\default\deploy\jboss-web.deployer\server.xml (where %jboss-root% is the JBoss root directory).

    Once you find the server.xml file, look for a <Connector> tag with the attribute port="8080" (or whichever port you use). Then add the attribute URIEncoding="UTF-8" to that tag. Make sure it's written exactly like that, since XML is case-sensitive.
    Birnam wood is come to Dunsinane

  8. #8
    SitePoint Member
    Join Date
    Oct 2008
    Posts
    2
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thx for the reply.

    I'll give it a go.

    Nicolai

  9. #9
    SitePoint Wizard bronze trophy Black Max's Avatar
    Join Date
    Apr 2007
    Posts
    4,029
    Mentioned
    12 Post(s)
    Tagged
    0 Thread(s)
    Off Topic:

    Busy, I have to say, from my own accessibility viewpoint, that blinking avatar is maddening. Just my opinion, but geesh.

  10. #10
    Twitter - @CarlBeckel busy's Avatar
    Join Date
    May 2004
    Location
    Richmond, VA, USA
    Posts
    819
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Black Max View Post
    Off Topic:

    Busy, I have to say, from my own accessibility viewpoint, that blinking avatar is maddening. Just my opinion, but geesh.
    Off Topic:

    I've been meaning to change it; I've certainly been catching some heat for it and it's become stale by now anyways.

  11. #11
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Off Topic:

    I find all blinking or moving content maddening, which is why I've disabled animation in my browser. Thus my first thought when I read Max's post was, 'what blinking avatar?'
    Birnam wood is come to Dunsinane

  12. #12
    Twitter - @CarlBeckel busy's Avatar
    Join Date
    May 2004
    Location
    Richmond, VA, USA
    Posts
    819
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Off Topic:

    Well then, here's a shark!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •