Character Encoding Issue

How did you deduce this from the base64 encoding so I can test it to. I think I’ve narrowed it down to the import/data source. Because I can update the row in the database with the correct encoding and it works properly. So the problem must be occurring before the data is put in the database (or while).

Edit:
It must have something to do with the way I imported the csv file. I opened the csv file up in notepad found the letter that later appears as double encoded. I then copied the character from notepad into a php script and printed ord(ü). The output was 195. As long as there is nothing wrong with my test method. I’m then assuming that the error occurred during the import. Any ideas what the error might be?

Basically you walk through the string, one byte at a time and pass it to ord().


for ($i = 0; $i < strlen($str); $i++)
    echo ord($str[$i]), ', ';

Accessing a string by its index always returns a single byte. substr() too, unless you’re overloading the function via the multibyte extension.

A “character” might be represented by multiple consecutive bytes. If you pass a string that contains more than 1 byte to ord(), it examines only the first byte. The manual sais ord inspects the first character, but php doesn’t understand multibyte characters. It thinks a character is always 1 byte. Most php string functions are like this.

Be careful pasting non-ascii characters into a php source file. The result can be confusing depending on the encoding your editor uses to save the file with. You can’t visually tell the differece by looking at it, but you might might not get what you expect. Inspecting the string as above, and also using strlen() (which measures bytes, not characters) should help you get a grip on what is actually in your strings. This is why I used chr() to produce the utf8 version of that character, which is 2 bytes.

I’m under pressure to get this fixed so I wrote up a quick script to get it fixed temporarily. The purpose of the script was to locate all characters that were messed up and replace them with the correct character.

However, When I run the script it continues to be a problem. I can however, go into mysql command line and edit the value in directly and it works fine. But when I use the script below the encoding issue persists:

        public function UpdateEncoding()
        {
                global $db;

                $encode = array("/&#195;&#194;&#169;/","/&#195;&#188;/","/&#195;&#164;/","/&#195;&#182;/","/&#195;&#8211;/","/&#195;&#161;/","/&#195;&#179;/","/&#195;&#169;/","/&#195;&#168;/","/&#195;&#180;/","/&#195;&#167;/","/&#195;&#174;/","/&#195;&#162;/","/&#195;/");
                //$replace = array("&#233;","&#252;","&#228;","&#246;","&#214;","&#225;","&#243;","&#225;","&#233;","&#232;","&#244;","&#231;","&#238;","&#226;","&#237;");
$replace = array(chr(195).chr(169),chr(195).chr(188),chr(195).chr(164),chr(195).chr(182),chr(195).chr(150),chr(195).chr(161),chr(195).chr(179),chr(195).chr(161),chr(195).chr(169),chr(195).chr(168),chr(195).chr(180),chr(195).chr(167),chr(195).chr(174),chr(195).chr(162),chr(195).chr(173));
                $data = array();

                $sql = "SELECT id FROM entry ORDER BY id ASC";
                $result = $db->query($sql);

                while($row = $result->fetch_assoc())
                {
                        $sql2 = "SELECT * FROM entry WHERE id =" . $row['id'];
                        $entryData = $db->query($sql2);
                        $entry = $entryData->fetch_assoc();

                        $cols = array('fname','lname','title','dept','organizaition','address','expertise');

                        foreach($cols as $col){
                                for($x=0;$x<count($encode);$x++){
                                        $data[$col] = preg_replace($encode[$x], $replace[$x], $entry[$col]);
                                        $entry[$col] = $data[$col];
                                }

                        }

                        $sql3 = "UPDATE entry SET ";
                        $sql3 .= 'fname="' . $data['fname'] . '"';
                        $sql3 .= ' AND lname="' . $data['lname'] . '"';
                        $sql3 .= ' AND title="' . $data['title'] . '"';
                        $sql3 .= ' AND dept="' . $data['dept'] . '"';
                        $sql3 .= ' AND organization="' . $data['organization'] . '"';
                        $sql3 .= ' AND address="' . $data['address'] . '"';
                        $sql3 .= ' AND expertise="' . $data['expertise'] . '"';
                        $sql3 .= " WHERE id = " . $entry['id'];

                        $db->query($sql3);
                }
        }

Do you check your query for errors?
there can be some, because you don’t escape your data

Yes the queries all work as I printed them out and then inputted them into the mysql command line. However, even when I input them in via the command line they don’t get saved with the correct letter. It reverts back to the letters with the messed up encoding.
I didn’t really see any reason to escape the data since I know what all the data is and this function will be removed before the site is a live state.

Hi bar338 I had exactly the same issue. I think the thread where we engaged in about this issue might help you The link is

http://www.sitepoint.com/forums/showthread.php?t=631407&highlight=co.ador

However, even when I input them in via the command line they don’t get saved with the correct letter.

Did you set proper encoding for the console client?
Does your console support proper encoding?

I didn’t really see any reason to escape the data since I know… blah blah blah

So, you can lose some of your data then.
Because you have no idea what escaping for.

Good point. Don’t confuse what escaping can be used for for what it does. It’s true that escaping is used to make database queries safer. But how does it do this? By signifying that the character following is to be considered as a “literal” character. That is, if it has some other significance besides it’s appearance it nullifies it.

In fact depending on how many and what type of layers the data gets passed through, you may even need to escape the escapes.