SitePoint Sponsor |
|
User Tag List
Results 1 to 10 of 10
Thread: XML encoding type to use
-
Mar 17, 2009, 11:59 #1
XML encoding type to use
Hi Guys,
I have written a php application that parses an XML file and uploads the values to a database. However, when I try to upload a different language like french for example my PHP outputs errors as shown below.
Errors:
PHP Code:Warning: DOMDocument::loadXML() [domdocument.loadxml]: Input is not proper UTF-8, indicate encoding ! Bytes: 0xE9 0x6E 0x6F 0x6D in Entity, line: 8 in C:\wamp\www\admin\classes\admin.class.php on line 2017
PHP Code:<?xml version="1.0" encoding="utf-8" ?>
<product_release>
<dt_phrases>
<id>1</id>
<language>french</language>
<page_name>register</page_name>
<array_key>page_title</array_key>
<phrase>Nouvel enregistrement d'utilisateur Prénom</phrase>
</dt_phrases>
</product_release>
Questions:
1) Which encoding should I be using in this XML document?
2) Is there any special validation that I need to do before inserting the <phrase> value into the database or can it just be inserted as is?
Thank you in advance.
-
Mar 17, 2009, 14:47 #2
- Join Date
- Apr 2008
- Location
- North Carolina
- Posts
- 438
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Since you are pulling from the database, make sure the encoding in the database (db, table, column) is set to UTF-8 as well.
[read: PHP Sec. | CSRF | PCRE Mods | Encryption | Form Proc. | File Val.]
[tools: PHPEd | PHP Docs | jQuery | CI | SwiftMailer | CKEditor | reCAPTCHA]
-
Mar 17, 2009, 15:13 #3
- Join Date
- May 2006
- Posts
- 75
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
-
Mar 18, 2009, 03:24 #4
-
Mar 18, 2009, 03:26 #5
- Join Date
- Oct 2006
- Location
- France, deep rural.
- Posts
- 6,869
- Mentioned
- 17 Post(s)
- Tagged
- 1 Thread(s)
A search on "Warning: DOMDocument::loadXML() [domdocument.loadxml]: Input is not proper UTF-8" found this advice:
... probability you have special characters in the xml string!
you need to convert all for utf8..... use "utf8_encode()"
-
Mar 18, 2009, 03:31 #6
- Join Date
- Oct 2006
- Location
- France, deep rural.
- Posts
- 6,869
- Mentioned
- 17 Post(s)
- Tagged
- 1 Thread(s)
Sure, but should I be using UTF-8 for french characters?
http://www.phpwact.org/php/i18n/charsets
http://www.nicknettleton.com/zine/ph...f-8-cheatsheet
http://kore-nordmann.de/blog/php_cha...e-htmlentities
http://kore-nordmann.de/blog/0082_ch..._encoding.html
http://alandean.blogspot.com/2009/01...-patterns.html
Heres some dumps about utf-8 from my bookmarks, HTHLast edited by Cups; Mar 18, 2009 at 03:32. Reason: first one was incorrect
-
Mar 18, 2009, 03:38 #7
-
Mar 18, 2009, 07:54 #8
- Join Date
- Apr 2008
- Location
- North Carolina
- Posts
- 438
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
UTF-8 contains all the chars that are in latin1_general_ci afaik
[read: PHP Sec. | CSRF | PCRE Mods | Encryption | Form Proc. | File Val.]
[tools: PHPEd | PHP Docs | jQuery | CI | SwiftMailer | CKEditor | reCAPTCHA]
-
Mar 18, 2009, 08:24 #9
- Join Date
- Oct 2006
- Location
- France, deep rural.
- Posts
- 6,869
- Mentioned
- 17 Post(s)
- Tagged
- 1 Thread(s)
Agree with simshaun, all the chars you want are in utf-8.
IIRC, the defn of collation is roughly how encode stuff as you shunt in and out of mysql, not how you actually store it.
The hidden secrets with utf-8 seem to be how it can mess with standard string functions, as relayed to me by kyber in this thread.
Help from salathe re utf-8 and regex's too in this thread.
-
Mar 18, 2009, 10:23 #10
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Collation is an algorithm for comparing strings. It's mostly used for sorting correctly. For example, in German ö comes in the middle of the alphabet, while the same character comes as the second last in Swedish. The collation controls this. A collation only works for the encoding, that it was intended for. That's why they are named like that. The collation doesn't change how data comes in and out of the database.
MySql is charset aware, so you can set the charset on a per-connection basis. If you set the connection to be utf-8, then MySql will assume that you pass it utf-8 encoded data. MySql is also aware of how data is stored internally. You can set this globally or per-table. If the connection charset differs from the storage charset, MySql will convert on in/out. This means that if you pick a storage charset that doesn't support all characters that you use, you're in trouble. Therefore, if you use utf-8 for the connection, it's a good idea to use utf-8 for storage.
Note that php isn't charset aware. Thus it's your responsibility to make sure that the data you pass to MySql is in the proper encoding. You can generally assume that input from browsers (eg. $_GET and $_POST) will be encoded in the same charset that the form which presented the form was in. In Firefox, you can go to the menu View -> Character Encoding and see what is selected. See Character Sets / Character Encoding Issues for more details on this.
Bookmarks