SitePoint Sponsor

User Tag List

Results 1 to 3 of 3

Hybrid View

  1. #1
    SitePoint Enthusiast
    Join Date
    Aug 2008
    Posts
    31
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question XML parsing to mysql via PHP (character encoding errors)

    Problem: I need to parse XML files which contain a lot of Irish Gaelic names.

    Due to the Gaelic language there are now times when we need to encode various extended characters for use on our servers (all Latin 1 ISO-8859-1).

    Here is the problem I am having:


    Text as sent to parser function (note É,í,é,á within the text)


    Processing : test.xml...

    Code PHP:
    $data = file_get_contents($file);

    ------ RESULT -------

    Code XML:
    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <all>
    	<item>
    		<type>Memorial</yype>
    		<name>MCCOY Éadaoín</name>
    		<text>In memory of Éadaoín.
    Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
    		</text>
    		<date>2008-09-20</date>
    	</item>
    </all>
    then character conversion is undertaken

    Code PHP:
    $xmlstr = ($characters,$entities,$data);

    ------ RESULT -------

    Code XML:
    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <all>
    	<item>
    		<type>Memorial</type>
    		<name>MCCOY Éadaoín</name>
    		<text>In memory of Éadaoín.
    Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
    		</text>
    		<date>2008-09-20</date>
    	</item>
    </all>
    So at this stage we are perfectly fine, character conversion has worked, the character set used in the XML file is correct.

    Now we create a
    PHP Code:
    simpleXMLElement() 
    :

    Code PHP:
    $xml = new simpleXMLElement($xml);
     
    	foreach ( $xml->children() as $item ) {
     
     
    	$type = $item->type;
     
    	$name = $item->name;
     
    	$text = $item->text;
     
    	$date = $item->date;
     
     
     
    	---- at this stage we write this information to a database for later use in several applications ----
     
     
     
    	}


    And this gives:

    ------ RESULT ------

    Code XML:
    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <all>
    	<item>
    		<type>Memorial</type>
    		<name>MCCOY ÉadaoÃ*n</name>
    		<text>In memory of ÉadaoÃ*n.
    Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
    		</text>
    		<date>2008-09-20</date>
    	</item>
    </all>


    if we attempt to convert the text after simpleXMLElement() is declared (i.e. within the loop) we get:

    ------ ReSULT ------

    Code XML:
    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <all>
    	<item>
    		<type>Memorial</type>
    		<name>MCCOY &amp;#195;‰adao&amp;#195;*n</name>
    		<text>In memory of &amp;#195;‰adao&amp;#195;*n.
    Remembered by Ronán, Isib&amp;#195;©al, Orla, Muireann and D&amp;#195;¡ire.
    		</text>
    		<date>2008-09-20</date>
    	</item>
    </all>
    ----------------------------------------------

    My guess is that simpleXMLElement is using a different character set, but I cannot find anything in the manual that tells me how to either override this or otherwise process the text.


    Up until now I have been using pretty much the same script to parse 150+ of these items per week with no problems, it is only when we introduce the extended characters (actues, graves etc) into the string that the parser is having problems.


    Note that the collation in the database is Latin1 (as is this character set), that the conversion is OK prior to entering simpleXMLElement().

    Anyone have any ideas? This is new ground for me so I am totally lost at the moment.
    Code PHP:
     

  2. #2
    SitePoint Enthusiast
    Join Date
    Aug 2008
    Posts
    31
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Solved:

    For any of you who are interested in how I got this working, see below.

    It appears to be a feature or bug within libXML ( that simpleXML uses ). To get over it I did the following:

    1. Changed the encoding within the XML file to UTF-8
    <?xml version="1.0" encoding="UTF-8" ?>

    2. Used utf8_encode on the text on the way into the function:
    PHP Code:
    $rawxml file_get_contents($file);

    $xmlstr utf8_encode(str_replace($chars$replace$rawxml));

    $xml simpleXMLElement($xmlstr); 
    3. Used utf8_decode on the text on the way out of the function:
    PHP Code:
            $name    =    utf8_decode(str_replace($characters,$entities,$item->name));        
            
    $text        =    utf8_decode(str_replace($characters,$entities,$item->text)); 
    I'm aware that there are probably extra steps in this solution but there are no performance issues to worry about (runs 2x per week as a cron).

  3. #3
    SitePoint Member
    Join Date
    Sep 2008
    Location
    london
    Posts
    13
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hi there,

    i have some problem in XML feed please fix the problem.


    line 9, column 0: XML parsing error: <unknown>:9:0: unbound prefix

    waiting your positive reply

    thanks
    James Anthony
    www.discoverblack.com
    Buy black clothing fashion online


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •