SitePoint Sponsor

User Tag List

Results 1 to 14 of 14
  1. #1
    SitePoint Zealot nicc9's Avatar
    Join Date
    Jan 2005
    Location
    New Orleans, LA
    Posts
    181
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    weird character encoding prob using AJAX

    hi, all.

    I build a little app that uses Ajax/PHP.

    from a main page, one can send some text to the DB, using an Ajax request.

    now, the main page has character encoding 8859-1, set both in the metatags and with PHP, as header('Content-type ... etc.

    same thing in the destination file (the one that does the processing in the background). the character encoding is set using PHP, and is 8859-1.

    well, the problem is that the text is posted on the DB in UTF-8 (I believe) or anyway a wrong charset.

    what could be the cause?

    thanks!

  2. #2
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    XmlHttpRequest defaults to send data as UTF-8 encoded, no matter what encoding the calling page is in. You have to set the encoding explicitly.

  3. #3
    SitePoint Guru Chroniclemaster1's Avatar
    Join Date
    Jun 2007
    Location
    San Diego, CA
    Posts
    784
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    And this isn't the only place that crops up. More and more UTF-8 is emerging as the standard character encoding. And since changing all your references for iso-8859-1 to utf-8 causes no problems, it's the better way to go. Virtually all the characters encodings are identical, utf-8 is just a WHOLE LOT bigger and includes many more.
    Whatever you can do or dream you can, begin it.
    Boldness has genius, power and magic in it. Begin it now.

    Chroniclemaster1, Founder of Earth Chronicle
    A Growing History of our Planet, by our Planet, for our Planet.

  4. #4
    SitePoint Zealot nicc9's Avatar
    Join Date
    Jan 2005
    Location
    New Orleans, LA
    Posts
    181
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Red face

    Quote Originally Posted by kyberfabrikken View Post
    XmlHttpRequest defaults to send data as UTF-8 encoded, no matter what encoding the calling page is in. You have to set the encoding explicitly.
    right, so you mean I have to set it from within Javascript, and not the page (PHP), right?

    is there like a function or is part of the XMLHttpRequest object itself?

    thanks!!!

  5. #5
    SitePoint Zealot nicc9's Avatar
    Join Date
    Jan 2005
    Location
    New Orleans, LA
    Posts
    181
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Unhappy

    Quote Originally Posted by Chroniclemaster1 View Post
    And this isn't the only place that crops up. More and more UTF-8 is emerging as the standard character encoding. And since changing all your references for iso-8859-1 to utf-8 causes no problems, it's the better way to go. Virtually all the characters encodings are identical, utf-8 is just a WHOLE LOT bigger and includes many more.
    right. I know what a char encoding is, and I actually read a few articles about it, but I THINK i'M missiing a few important points.

    what I care about are accented letters, such as ,, , etc...

    now, I know that UTF-8 has those.

    however, what tells the back-end what char encoding the input from the user is? does it depend on the operating system or the page's char encoding?


    I'd have no prob using UTF-8 as default character encoding for all my stuff, it's just that I thought users would send the back-end text encoded in 8851-1, because of the OS.

    also, many times I ahve to copy and paste text from Word into client's sites, and that's always iso-8859-1.

    kinda of counfusing...

  6. #6
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by nicc9 View Post
    right, so you mean I have to set it from within Javascript, and not the page (PHP), right?
    Yep, that's what I meant.

    Quote Originally Posted by nicc9 View Post
    is there like a function or is part of the XMLHttpRequest object itself?
    I thought there was. However, I just checked to see, and apparently, there is no way to specify the encoding of data, in the request. So it would seem, that you need to use UTF-8.

    Quote Originally Posted by nicc9 View Post
    however, what tells the back-end what char encoding the input from the user is? does it depend on the operating system or the page's char encoding?
    PHP always assumes ISO-8859-1. Since PHP uses ISO-8859-1 internally, you would have to manually decode UTF-8 to ISO-8859-1, if the transfer encoding is UTF-8. You can use the function utf8_decode for this.

    Quote Originally Posted by nicc9 View Post
    also, many times I ahve to copy and paste text from Word into client's sites, and that's always iso-8859-1.
    Actually, it's probably CP-1252 (aka. Windows-1252). However, ISO-8859-1 and CP-1252 are very similar (Except for, I think, 13 characters, which are rather exotic), so most of the time, you won't notice the difference.

  7. #7
    SitePoint Zealot nicc9's Avatar
    Join Date
    Jan 2005
    Location
    New Orleans, LA
    Posts
    181
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Talking

    thanks for your help!

    that's weird PHP assumes 8859-1... I didn't know that... so no matter what encoding the page is, like if I put UTF-8 in the metatags, PHP will always tread data as if it were 8859-1?

    but anyway - basically, when the data leaves the page is UTF-8, because JS has encoded it that way. then, I have to decode it when it gets to the PHP script, and then put it on the DB.

    did I get it right?

  8. #8
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by nicc9 View Post
    that's weird PHP assumes 8859-1... I didn't know that... so no matter what encoding the page is, like if I put UTF-8 in the metatags, PHP will always tread data as if it were 8859-1?
    Yeah, blame it on Eurocentrism. ISO-8859-1 is an 8bit encoding; Dealing with strings as streams of 8bit characters is simpler, than treating them as variable-length streams (UTF-8), but it limits the number of possible characters. The major change from PHP5 to PHP6 is a rewrite of the entire code base to use UTF-8 as the internal encoding. That will make dealing with UTF-8 much easier. Until then, you need to be very aware of what encodings are in play, where.

    Quote Originally Posted by nicc9 View Post
    but anyway - basically, when the data leaves the page is UTF-8, because JS has encoded it that way. then, I have to decode it when it gets to the PHP script, and then put it on the DB.

    did I get it right?
    Exactly.

    PHP will always assume input to be ISO-8859-1. Browsers will generally default to this, but if your page is in a different encoding (eg. UTF-8), it will assume that forms should be submitted in the same encoding, and do so. That is unless you have specified an encoding-attribute on the form, in which case it will generally obey this. I'm not entirely sure how reliable that is though. XMLHttpRequest however, always sends data as UTF-8. Presumably because UTF-8 is the default encoding for XML documents, and XMLHttpRequest was originally meant for sending/receiving XML-documents (Hence the name).

  9. #9
    SitePoint Zealot nicc9's Avatar
    Join Date
    Jan 2005
    Location
    New Orleans, LA
    Posts
    181
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Smile

    Quote Originally Posted by kyberfabrikken View Post
    Browsers will generally default to this, but if your page is in a different encoding (eg. UTF-8), it will assume that forms should be submitted in the same encoding, and do so. That is unless you have specified an encoding-attribute on the form, in which case it will generally obey this. I'm not entirely sure how reliable that is though. XMLHttpRequest however, always sends data as UTF-8. Presumably because UTF-8 is the default encoding for XML documents, and XMLHttpRequest was originally meant for sending/receiving XML-documents (Hence the name).
    sorry, I'm not sure I've got that.

    are you saying that if I set UTF-8 as char encoding using the HTML metatag the form data will be encoded in UTF-8? or will PHP still treat it as 8859-1?

    how do you do deal with this?

    if the standard is UTF-8 for most languages or tools, I'd probably want to have everything in that char encoding.... if all I have to do is setting the metatags than that'd be easy. else if PHP assumes 8859-1 no matter what encoding is specified in the metatags, do I have to utf8_encode everything?

    thanks for your help!!!

  10. #10
    SitePoint Guru
    Join Date
    Mar 2004
    Posts
    639
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by nicc9 View Post
    if the standard is UTF-8 for most languages or tools, I'd probably want to have everything in that char encoding.... if all I have to do is setting the metatags than that'd be easy. else if PHP assumes 8859-1 no matter what encoding is specified in the metatags, do I have to utf8_encode everything?
    In my experience to get everything work right, I needed to switch everything to UTF-8 including database collation, tables collation, tables charset, HTML meta tag, SET NAMES query for MySQL, PHP encoding tag, XML encoding tag - everything to UTF-8. Only after this everything started work fine.

    I never used utf8_encode/utf-8_decode functions as it seems everything work fine without them. But I may be wrong.

  11. #11
    SitePoint Zealot nicc9's Avatar
    Join Date
    Jan 2005
    Location
    New Orleans, LA
    Posts
    181
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hey al9, what is the difference between table collation and table char set? and what is SET NAMES?

  12. #12
    SitePoint Guru
    Join Date
    Mar 2004
    Posts
    639
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by nicc9 View Post
    hey al9, what is the difference between table collation and table char set? and what is SET NAMES?
    If I understand this correct, collation is how data is given to external software. Charset is how data is storedin database. I don't know what exactly does SET NAMES, but it fixed few UTF-8 related problems in my PHP/MySQL/XML application.

  13. #13
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by nicc9 View Post
    sorry, I'm not sure I've got that.

    are you saying that if I set UTF-8 as char encoding using the HTML metatag the form data will be encoded in UTF-8?
    If the page is served as UTF-8, the browser will send UTF-8 back, using forms. The meta-tag is a bit tricky, since it's generally unreliable. The meta-tag is only recognised by the browser, if the web server doesn't send a header, specifying the encoding. Usually it will, and usually, it will specify ISO-8859-1. So look out for that. If you're in doubt, try selecting View > Character Encoding from the browsers (Firefox) menu; This will show you which charset, the page is interpreted as.

    Quote Originally Posted by nicc9 View Post
    hey al9, what is the difference between table collation and table char set? and what is SET NAMES?
    Collation is a culture setting in the database. It governs how strings are sorted. For example, in German comes after O, but in Swedish, it comes last in the alphabet.

    Quote Originally Posted by nicc9 View Post
    or will PHP still treat it as 8859-1?

    how do you do deal with this?
    PHP assumes all strings to be ISO-8859-1. If your page is served as UTF-8, you must expect the browser to send UTF-8 back. Thus you have to manually transform the data from UTF-8 => ISO-8859-1, before you can use it in PHP. You can use the PHP function utf8_decode for this. There is a limitation to this strategy, because UTF-8 is capable of representing characters, which simply can't exist in ISO-8859-1. For example, Arab or Russian characters. If you don't need that, I recommend this approach.

    SET NAMES changes the encoding of the database connection. Eg. the connection between the database and PHP. So even if your tables are stored in one charset, they can be exposed to PHP, in a different charset.

    Assuming, that you need to use the full unicode range, in your content, you can't use ISO-8859-1 internally in PHP. You can then chose to let the database send UTF-8 data to PHP. This is the strategy, that al9 has used.
    It saves you from encoding/decoding all in/out to the browser, and it allows you to use the full range of UTF-8. The downside of this approach is, that all PHP's internal string manipulation functions assume that the string is in ISO-8859-1, so they will not work appropriately. For example, strlen() will return the wrong result, if you have non-ASCII characters in the string, since these are encoded as multiple bytes in UTF-8, but as single bytes in ISO-8859-1.
    PHP has an extension, which can overload string functions to work with UTF-8. If you want to use UTF-8 internally in PHP, you should probably try to use that.

  14. #14
    SitePoint Zealot nicc9's Avatar
    Join Date
    Jan 2005
    Location
    New Orleans, LA
    Posts
    181
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    kyberfabrikken, thanks a 1000 for your help and explenations, you rock!

    I still haven't made up my mind about what to use, but I think I'll go for 8859-1 if that's PHP's default and other char sets can cause probs.

    guess it's easier just decoding data that comes from AJAX requests...


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •