SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Addict
    Join Date
    Feb 2004
    Location
    belfast
    Posts
    386
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Characters missing from Latin-1

    Hey all,

    I'm looking a bit of help. If I have a UTF-8 file and I want to store it in a Latin-1 DB what characters will I end up having problems with?

    Basically, what characters are in missing from the Latin-1 char set?

    I found this link that lists 27 characters that are in UTF-8 and not in Latin-1.

    Am I correct in the way I have read this table?

  2. #2
    SitePoint Wizard chris_fuel's Avatar
    Join Date
    May 2006
    Location
    Ventura, CA
    Posts
    2,750
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Well, I'm not sure if missing characters are the worst of your problems. Latin-1 is a single byte encoding, while UTF-8 is a multi-byte encoding. That means that when your data, the database is expecting 1 byte = 1 character, but with utf8, it's x bytes = 1 character. That's the main problem. If you're using MySQL, you can convert the database to a different characterset though.

  3. #3
    SitePoint Addict
    Join Date
    Feb 2004
    Location
    belfast
    Posts
    386
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi chris,

    thanks for the reply. I'm using Sybase in this instance and not mySQL. Thanks for the technical info, much appreciated. I've a small problem in that i'm talking to management people and most dont come from dev backgrounds. The way you explained it is fine for me, and in general will be easy to understand for them however if I was to show them the chart in the link and say these are a list of the common characters that may prove problamatic would that cover most of my cases?

  4. #4
    SitePoint Wizard chris_fuel's Avatar
    Join Date
    May 2006
    Location
    Ventura, CA
    Posts
    2,750
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I haven't really done anything with SyBase to be honest. What I would do is create a test database, store the data in it, and see what data comes back. Then it's simply a matter of showing "here's the actual data, here's what the data becomes when stored in the database".

  5. #5
    reads the ********* Crier silver trophybronze trophy longneck's Avatar
    Join Date
    Feb 2004
    Location
    Tampa, FL (US)
    Posts
    9,854
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    if you found a list of 27 characters, then it's about 2^16 - 27 characters too short.

    what you need to do is look at from the other way around. look at the character listing for latin1. then look at you data and see if there are any characters outside of that listing. basically, if your text is all english, then you should be fairly safe just blindly converting it to latin1.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •