Problem: I need to parse XML files which contain a lot of Irish Gaelic names.

Due to the Gaelic language there are now times when we need to encode various extended characters for use on our servers (all Latin 1 ISO-8859-1).

Here is the problem I am having:


Text as sent to parser function (note É,í,é,á within the text)


Processing : test.xml...

Code PHP:
$data = file_get_contents($file);

------ RESULT -------

Code XML:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<all>
	<item>
		<type>Memorial</yype>
		<name>MCCOY Éadaoín</name>
		<text>In memory of Éadaoín.
Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
		</text>
		<date>2008-09-20</date>
	</item>
</all>
then character conversion is undertaken

Code PHP:
$xmlstr = ($characters,$entities,$data);

------ RESULT -------

Code XML:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<all>
	<item>
		<type>Memorial</type>
		<name>MCCOY Éadaoín</name>
		<text>In memory of Éadaoín.
Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
		</text>
		<date>2008-09-20</date>
	</item>
</all>
So at this stage we are perfectly fine, character conversion has worked, the character set used in the XML file is correct.

Now we create a
PHP Code:
simpleXMLElement() 
:

Code PHP:
$xml = new simpleXMLElement($xml);
 
	foreach ( $xml->children() as $item ) {
 
 
	$type = $item->type;
 
	$name = $item->name;
 
	$text = $item->text;
 
	$date = $item->date;
 
 
 
	---- at this stage we write this information to a database for later use in several applications ----
 
 
 
	}


And this gives:

------ RESULT ------

Code XML:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<all>
	<item>
		<type>Memorial</type>
		<name>MCCOY ÉadaoÃ*n</name>
		<text>In memory of ÉadaoÃ*n.
Remembered by Ronán, Isibéal, Orla, Muireann and Dáire.
		</text>
		<date>2008-09-20</date>
	</item>
</all>


if we attempt to convert the text after simpleXMLElement() is declared (i.e. within the loop) we get:

------ ReSULT ------

Code XML:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<all>
	<item>
		<type>Memorial</type>
		<name>MCCOY &amp;#195;‰adao&amp;#195;*n</name>
		<text>In memory of &amp;#195;‰adao&amp;#195;*n.
Remembered by Ronán, Isib&amp;#195;©al, Orla, Muireann and D&amp;#195;¡ire.
		</text>
		<date>2008-09-20</date>
	</item>
</all>
----------------------------------------------

My guess is that simpleXMLElement is using a different character set, but I cannot find anything in the manual that tells me how to either override this or otherwise process the text.


Up until now I have been using pretty much the same script to parse 150+ of these items per week with no problems, it is only when we introduce the extended characters (actues, graves etc) into the string that the parser is having problems.


Note that the collation in the database is Latin1 (as is this character set), that the conversion is OK prior to entering simpleXMLElement().

Anyone have any ideas? This is new ground for me so I am totally lost at the moment.
Code PHP: