Character encoding dilemma

I don’t know if this is a PHP, MySQL, WordPress, SimplePie, or RSS feed authoring problem, but I guess PHP is as good a place as any to post this.

For quite some time I have been trying to figure out what has been causing unserialize() errors in my WordPress blog. WordPress uses the error suppressor, but not being content to merely accept the errors as unavoidable, I have been investigating the problem. By comparing the original RSS feed against it’s serialized cache value I noticed a few things. i.e. “strange characters” in place of “fancy characters”.

Note* everything on my end is UTF-8, the feed encoding is specified as UTF-8, but according to the W3C feed validator

line 55, column 67: title contains bad characters (16 occurrences)
<title>Press Release: CEO Cathy Zoi Named to Weather Channelâ\x80\x99s 2008 …

For example, one such is the above “curly apostrophe”, decimal 8217, hex 2019. Using both var_dump() and phpMyAdmin, instead of seeing that in the serialized <title> value, I saw question marks and a cent symbol.

My head starts to swim with all the UTF-8 <-> binary conversions, but what seems to be happening is this:
1: “high” character typed directly into RSS feed content from text editor
[WordPress serializes feed content and uses SimplePie to cache data into MySQL]
2: feed serialized and caching stores single character in database as 3-byte UTF-8 code
[WordPress uses SimplePie to read cached feed data from MySQL and attempts to unserialize()]
3: read from database as 3-bytes
4: instead of going back to a single 3-byte UTF-8 code character, each of 3 bytes are changed to 2-byte UTF-8 code
5: so the original single 3-byte character, when var_dump()ed, is rendered as 3 2-byte characters
6: doubling of bytes breaks unserialize()

I suspect WordPress may know what it’s doing by suppressing the errors. That is, poorly crafted feed content (#1) is outside the realm of control. On the other hand, I wonder if there might be a way to correct #4. Or maybe by making changes somewhere else in the process?

Ironically, one of’s own blog feeds also contains “high characters” which breaks unserialize() The feed passes validation (except for an <embed>) so I think it’s all UTF-8 copacetic, but the serialize/unserialize still gets the numnber of bytes wrong. Maybe something MySQL does? PHP bug?

Any thoughts or ideas on how to approach this further, or just give up?


Well, I’ve done quite a bit of testing to see if it breaks or if I can break it on purpose.

If I just use PHP alone with hard-coded strings, some characters won’t show, but unserialize doesn’t break. Same with using the live feeds.

When I get MySQL involved, same thing.

Same for when I toss SimplePie as is into the mix.

So now I’m trying to dig my way through the WordPress classes that extend the SimplePie classes.

Sometimes I wish I wasn’t so stubborn :wink:

is it possible to inject debugging code into different parts of your wordpress app?
code that tries to unserialize and writes result to a file?
I’d place such code into many places to see where it gets broken.