SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Zealot
    Join Date
    May 2003
    Location
    Sarasota, FL
    Posts
    196
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Fixing content that mixes UTF-8 and ISO-8859-1 encoding?

    I have some content that is coming from the DB (originally entered into a WYSIWYG CMS editor) and appears to mix both UTF-8 encoded text with ISO-8859-1 text. The pages themselves use UTF-8 encoding and the database is using utf8_general_ci encoding. However, I'm still getting some content that is appearing as � characters.

    I've tried using various combinations of utf8_encode, utf8_decode and mb_convert_encoding but in every case some data is always appearing as �, ™ or "?"

    Is there a magic function that will fix this, either in MySQL or PHP, or is there another solution short of rekeying all of the data using a single charset?
    Chris Bloom
    Web Application Developer

  2. #2
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by xangelusx View Post
    Is there a magic function that will fix this
    mb_detect_encoding can magically detect the charset. You can then convert it to a common standard (UTF-8 would be a good candidate). Note that mb_detect_encoding makes a qualified guess, so it may give false positives, but it's reasonably reliable.

    This of course presupposes that each document has only data in one encoding, that is simply unknown. If you have data from different encoding mixed up in the same record, then I don't think there is any reliable way of fixing it.

  3. #3
    SitePoint Zealot
    Join Date
    May 2003
    Location
    Sarasota, FL
    Posts
    196
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I was using a check at one point that compared the original string with the result of processing the string with utf_encode, the intent being that it would tell me if anything was encoded and thus if it needed encoding in the first place. That didn't seem to have any effect, but i will try using the function you suggest and see if there is a difference.

    However, I'm pretty sure I'm dealing with content that is a mixed bag of encodings (per entry). I'm assuming this is the result of pasting content from various sources like Mac files, PDFs, Word documents, etc., and mixing it with content created in the CMS WYSIWYG editor itself...
    Chris Bloom
    Web Application Developer

  4. #4
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by xangelusx View Post
    However, I'm pretty sure I'm dealing with content that is a mixed bag of encodings (per entry). I'm assuming this is the result of pasting content from various sources like Mac files, PDFs, Word documents, etc., and mixing it with content created in the CMS WYSIWYG editor itself...
    In that case, you don't have much chance of fixing things up.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •