I just wrote a response to a very similar topic under my other alias with coding a DOM example etc. I have dealt with a lot of flawed XML feeds, and not even my ones, other companies developers screw them up and the people they work for chase me to tell them what their developers have buggered up. After a few phone calls on days off it gets a tad annoying. It is something worthwhile trying to break test yourself if you are inquisitive, unfortunately a lot of developers cut corners go "it works on my machine using abcde" run off and leave someone else to pick up the pieces.
Sorry if I seem narky but data exchange is a very serious ball game
Anyway the link is..
When dealing with XML you want a halt on unknowns and log fatals on your end not only on the persons reader
€(Euro) will not get translated and cause all readers in Latin 1 mode to fail. Windows uses 1252, anything copied from word will also possibly fail as it replaces some Latin 1 stuff with prettier Windows-1252 stuff. It is a minefield, especially in Latin 1 and not utf-8. As far as I am concerned every Latin 1 install is a problem waiting to happen( it misses a few characters here and there in Western Europe as well just to make it fun hence ISO-8859-15 and ISO-8859-14( just for those difficult welsh ) ).
If you find yourself hacking around something in XML you really are doing it wrong and it can just cascade( html entities is a hack, it does not know right from wrong and will just carry on ). People write iconv/ xml libraries etc for a very good reason as it is stuff that can make most people cry when it starts getting dirty. It is not simple but wrappers can be built around the dom/iconv etc to make it simple and relatively pain free, just takes a bit more brains now.
The joy of UTF-8 in browsers is they will translate from whatever source encoding is pasted into UTF-8 silently saving a whole load of problems( such a people copying from word documents).
Anyway enough XML stuff for tonight. I think 4 hours is enough and I'll end up cranky tomorrow still thinking about it. I hurt, character encodings are like a big swallowing hole