SitePoint Sponsor

User Tag List

Results 1 to 10 of 10
  1. #1
    SitePoint Wizard Zaggs's Avatar
    Join Date
    Feb 2005
    Posts
    1,051
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    XML encoding type to use

    Hi Guys,

    I have written a php application that parses an XML file and uploads the values to a database. However, when I try to upload a different language like french for example my PHP outputs errors as shown below.

    Errors:
    PHP Code:
    WarningDOMDocument::loadXML() [domdocument.loadxml]: Input is not proper UTF-8indicate encoding Bytes0xE9 0x6E 0x6F 0x6D in Entityline8 in C:\wamp\www\admin\classes\admin.class.php on line 2017 
    XML:
    PHP Code:
    <?xml version="1.0" encoding="utf-8" ?>
    <product_release>
        <dt_phrases>
            <id>1</id>
            <language>french</language>
            <page_name>register</page_name>
            <array_key>page_title</array_key>
            <phrase>Nouvel enregistrement d'utilisateur Prénom</phrase>
        </dt_phrases>
    </product_release>
    I have tried changing the encoding, but still no luck.

    Questions:

    1) Which encoding should I be using in this XML document?

    2) Is there any special validation that I need to do before inserting the <phrase> value into the database or can it just be inserted as is?

    Thank you in advance.

  2. #2
    SitePoint Evangelist simshaun's Avatar
    Join Date
    Apr 2008
    Location
    North Carolina
    Posts
    438
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Since you are pulling from the database, make sure the encoding in the database (db, table, column) is set to UTF-8 as well.

  3. #3
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    75
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by simshaun View Post
    Since you are pulling from the database, make sure the encoding in the database (db, table, column) is set to UTF-8 as well.
    Zaggs said he's reading XML file, not pulling data from database.

    Check your editor that you are using to edit XML file. Make sure that encoding is correct.

  4. #4
    SitePoint Wizard Zaggs's Avatar
    Join Date
    Feb 2005
    Posts
    1,051
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by STeeL_LT View Post
    Zaggs said he's reading XML file, not pulling data from database.

    Check your editor that you are using to edit XML file. Make sure that encoding is correct.
    Sure, but should I be using UTF-8 for french characters?

  5. #5
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    A search on "Warning: DOMDocument::loadXML() [domdocument.loadxml]: Input is not proper UTF-8" found this advice:

    ... probability you have special characters in the xml string!
    you need to convert all for utf8..... use "utf8_encode()"
    Plenty of other replies via that googlesearch.

  6. #6
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Sure, but should I be using UTF-8 for french characters?
    Yes, but you have to make sure you are using utf-8 all the way through, from your text editor, to mysql via PHP and into your output to webpages' html declarations. If you are serious about i18n then it seems the best way to go.

    http://www.phpwact.org/php/i18n/charsets
    http://www.nicknettleton.com/zine/ph...f-8-cheatsheet
    http://kore-nordmann.de/blog/php_cha...e-htmlentities
    http://kore-nordmann.de/blog/0082_ch..._encoding.html
    http://alandean.blogspot.com/2009/01...-patterns.html

    Heres some dumps about utf-8 from my bookmarks, HTH
    Last edited by Cups; Mar 18, 2009 at 03:32. Reason: first one was incorrect

  7. #7
    SitePoint Wizard Zaggs's Avatar
    Join Date
    Feb 2005
    Posts
    1,051
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Cups View Post
    Yes, but you have to make sure you are using utf-8 all the way through, from your text editor, to mysql via PHP and into your output to webpages' html declarations. If you are serious about i18n then it seems the best way to go.

    http://www.phpwact.org/php/i18n/charsets
    http://www.nicknettleton.com/zine/ph...f-8-cheatsheet
    http://kore-nordmann.de/blog/php_cha...e-htmlentities
    http://kore-nordmann.de/blog/0082_ch..._encoding.html
    http://alandean.blogspot.com/2009/01...-patterns.html

    Heres some dumps about utf-8 from my bookmarks, HTH
    Thank you for your help. The MYSQL database I am using uses "latin1_general_ci" collation. Obviously, I will need to change this to utf-8 if I am going to use utf-8. Is there any alternative encoding that will support "latin1_general_ci"?

  8. #8
    SitePoint Evangelist simshaun's Avatar
    Join Date
    Apr 2008
    Location
    North Carolina
    Posts
    438
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    UTF-8 contains all the chars that are in latin1_general_ci afaik

  9. #9
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    Agree with simshaun, all the chars you want are in utf-8.

    IIRC, the defn of collation is roughly how encode stuff as you shunt in and out of mysql, not how you actually store it.

    The hidden secrets with utf-8 seem to be how it can mess with standard string functions, as relayed to me by kyber in this thread.

    Help from salathe re utf-8 and regex's too in this thread.

  10. #10
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Collation is an algorithm for comparing strings. It's mostly used for sorting correctly. For example, in German &#246; comes in the middle of the alphabet, while the same character comes as the second last in Swedish. The collation controls this. A collation only works for the encoding, that it was intended for. That's why they are named like that. The collation doesn't change how data comes in and out of the database.

    MySql is charset aware, so you can set the charset on a per-connection basis. If you set the connection to be utf-8, then MySql will assume that you pass it utf-8 encoded data. MySql is also aware of how data is stored internally. You can set this globally or per-table. If the connection charset differs from the storage charset, MySql will convert on in/out. This means that if you pick a storage charset that doesn't support all characters that you use, you're in trouble. Therefore, if you use utf-8 for the connection, it's a good idea to use utf-8 for storage.

    Note that php isn't charset aware. Thus it's your responsibility to make sure that the data you pass to MySql is in the proper encoding. You can generally assume that input from browsers (eg. $_GET and $_POST) will be encoded in the same charset that the form which presented the form was in. In Firefox, you can go to the menu View -> Character Encoding and see what is selected. See Character Sets / Character Encoding Issues for more details on this.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •