SitePoint Sponsor |
|
User Tag List
Results 1 to 12 of 12
Thread: utf-8 for a perfect world?
-
Jul 2, 2008, 08:09 #1
- Join Date
- May 2004
- Location
- Richmond, VA, USA
- Posts
- 819
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
utf-8 for a perfect world?
Note: This is a discussion thread - not a quick answer thread. I'm looking for insight, not a charset recommendation.
I have always been slightly irritated by the multitude of charsets that are available. While I partially understand why they exist, it seems to me like the charset situation is convoluted and thrown together.
My main beef is with ISO 8859-1 (Latin-1). As I understand it, utf-8 is the same as Latin 1 and also gives you tons of additional characters.
Although there arefor issues concerning an optional BOM, why is Latin 1 still so prevalent? Is there any reason to prefer it over utf-8? Is it good to try to do everything in utf-8 (like I have been doing for the past year) for anything using the English language? Should the digital world be moving towards phasing out many of these charsets or do we really need them?
Is there ever any reason to choose Latin-1 over utf-8 on purpose?
-
Jul 2, 2008, 08:58 #2
- Join Date
- May 2006
- Location
- Aurora, Illinois
- Posts
- 15,476
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Have you read AutisticCuckoo's article on the subject by any chance? (I recall him answering many of the questions you asked here, but don't have the time to go through it and find them.)
http://www.sitepoint.com/article/gui...acter-encodingSave the Internet - Use Opera | May my mother rest in peace: 1943-2009
Dan Schulz - Design Team Advisor | Follow me on Twitter
SitePoint References: HTML CSS JavaScript | Become A Guru
WordPress SEO Checklist | What WordPress Plugins Do You Use?
Web Standards Curriculum | Image Free Equal Height Columns
-
Jul 2, 2008, 09:08 #3
- Join Date
- Dec 2007
- Posts
- 358
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
For me there is no reason to use any non-unicode encoding when it is possible to use unicode. There is "virtual" reason that Unicode encoding may double the size of the content (UTF-16) but in case of UTF-8 and site content in Latin1 there is no significant difference.
-
Jul 2, 2008, 10:26 #4
- Join Date
- May 2004
- Location
- Richmond, VA, USA
- Posts
- 819
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
I believe I have, and it's worth a re-read. That was probably one of the things that got me thinking about encodings in the first place.
Edit: Actually, I hadn't read that one until now. Thanks for the link!
Alex - I agree. Personally I've never even tried to use utf-16 for anything yet. Even with the size doubling, content is generally not where I run into size problems so it seems to me like a minor sacrifice to make for the sake of simplicity and standards. I'd love it if I only had to worry about utf-8 and utf-16 in circumstances that require it.Last edited by busy; Jul 2, 2008 at 13:51. Reason: Added to post
-
Jul 2, 2008, 14:16 #5
- Join Date
- Nov 2004
- Location
- Ankh-Morpork
- Posts
- 12,158
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
It's a quagmire that has evolved over time. The first computers didn't need to speak to one another, so there were no standard encodings. Then you got ASCII and EBCDIC which catered to the needs of American computer engineers in the '60s.
Not quite. And it's a bit complicated.
ISO 8859-1 is both a character repertoire and an encoding.
Unicode is a character repertoire.
UTF-8 is an encoding used with Unicode.
The first 128 code positions in Unicode are identical with US-ASCII and ISO 8859-1. They are also encoded the same way in UTF-8 (using a single octet).
The code positions from 128-255 in Unicode are identical with ISO 8859-1, but they are encoded using two octets in UTF-8.
Unicode then contains hundreds of thousands of other characters, which are encoded with two, three or four octets in UTF-8. Those characters are unavailable in ISO 8859-1.
Mainly for historical reasons. It's the default encoding in many text editors and point-and-click tools under Windows (GNU/Linux mostly uses UTF-8).
There are also many other components in the publishing chain that default to ISO 8859-1. The Apache http server, the Tomcat web container and the JBoss application server, for instance. It's not trivial to get an Apache/Tomcat/JBoss combo to handle UTF-8 properly (you need to apply an undocumented hack in a server XML config file buried about nine subdirectories deep).
UTF-16 is not a very good idea for web pages. It's a waste of space, since you'll double the size of the markup (all HTML markup characters are in the US-ASCII range and use a single octet in UTF-8). There's also questionable support in user agents, and then you have the whole endian thing.Birnam wood is come to Dunsinane
-
Oct 17, 2008, 02:28 #6
- Join Date
- Oct 2008
- Posts
- 2
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Hi.
I having trouble with UTF-16 and UTF-8 when switching from a windows to a linux platform using Jboss Application Server 4.2.2.GA
It was mentioned that you need to apply an undocumented hack in a server XML config file buried about nine subdirectories deep
How do you do that and where?
Nicolai
-
Oct 17, 2008, 05:12 #7
- Join Date
- Nov 2004
- Location
- Ankh-Morpork
- Posts
- 12,158
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
The exact location depends on the JBoss version. On my machine the file is %jboss-root%\server\default\deploy\jboss-web.deployer\server.xml (where %jboss-root% is the JBoss root directory).
Once you find the server.xml file, look for a <Connector> tag with the attribute port="8080" (or whichever port you use). Then add the attribute URIEncoding="UTF-8" to that tag. Make sure it's written exactly like that, since XML is case-sensitive.Birnam wood is come to Dunsinane
-
Oct 17, 2008, 05:25 #8
- Join Date
- Oct 2008
- Posts
- 2
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
thx for the reply.
I'll give it a go.
Nicolai
-
Oct 20, 2008, 22:25 #9Off Topic:
Busy, I have to say, from my own accessibility viewpoint, that blinking avatar is maddening. Just my opinion, but geesh.
-
Oct 21, 2008, 10:50 #10
- Join Date
- May 2004
- Location
- Richmond, VA, USA
- Posts
- 819
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
-
Oct 21, 2008, 11:40 #11
- Join Date
- Nov 2004
- Location
- Ankh-Morpork
- Posts
- 12,158
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Off Topic:
I find all blinking or moving content maddening, which is why I've disabled animation in my browser. Thus my first thought when I read Max's post was, 'what blinking avatar?'Birnam wood is come to Dunsinane
-
Oct 21, 2008, 16:14 #12
- Join Date
- May 2004
- Location
- Richmond, VA, USA
- Posts
- 819
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Off Topic:
Well then, here's a shark!
Bookmarks