SitePoint Sponsor

User Tag List

Results 1 to 15 of 15
  1. #1
    SitePoint Enthusiast Mounty's Avatar
    Join Date
    Mar 2008
    Location
    UK
    Posts
    90
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question Disabling multibyte character conversion?

    My php script (and mysql db) are using the default latin1 charset, but I've found that if I enter any multibyte text (such as japanese, hindu, etc) into my contact form script then php will automatically convert the text into html entities.

    I looked up some info on disabling the conversion and 'mbstring.encoding_translation' set to 'off' with 'mbstring.http_input' set to 'pass' was supposed to disable it.
    But according to my phpinfo() those values were already set to off and pass, yet php is still converting the mb text to html entities!


    It may seem a small thing, but I like to get to the bottom of every strange problem I encounter. Could this be a bug or am I missing something? This is under php 5.2.8

    cheers

  2. #2
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Under normal configuration PHP does not do anything to alter your text. Only an explicit call to htmlentities() or htmlspecialchars() or similar would convert characters into entities. You're sure this isn't happening anywhere in your code?

  3. #3
    SitePoint Enthusiast Mounty's Avatar
    Join Date
    Mar 2008
    Location
    UK
    Posts
    90
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Aye I've setup a basic test script using the data straight from $_REQUEST, just double checked and there's no calls to those functions. By the time my script inserts to the db, the mb text has been converted to html entities.

  4. #4
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The browser is doing it. Make sure that your pages (with the forms) are using Unicode (UTF-8, etc.).

  5. #5
    SitePoint Enthusiast Mounty's Avatar
    Join Date
    Mar 2008
    Location
    UK
    Posts
    90
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Question

    Hi there! You're right it was the browser, a real tricky problem that was.

    I have another one though - I found that if I specify the web page content as utf-8 in the meta data, and store any text sent from the form to a latin1 database, then the string stored will be completely garbled and unreadable. But then If I retreive the string and display it within another utf-8 page, it seems to be the original text?

    How does that work? Because If latin1 isn't supposed to have the capacity for multibyte characters, then is it some how automatically using multiple bytes in the latin1 database?

  6. #6
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    The sequence of bytes that appeared in your INSERT query is the sequence of bytes stored.

    The character set you specify for the table only tells MySQL how to handle the values when you do things like apply string manipulation functions in the query. Otherwise it doesn't care what's in your string columns, it's just binary data you've told it happens to be text in some encoding.

    If you interpret the sequence of bytes as UTF8, you see the characters that were input. If you interpret it as Latin1, you get junk where the character sets differ in their encoding. But you're looking at the same 0's and 1's in both cases. As long as you don't modify the data between the user input, the insert, the retrieval and the displaying, then none of those settings (at the database, web server or client level) actually change the data.

  7. #7
    SitePoint Enthusiast Mounty's Avatar
    Join Date
    Mar 2008
    Location
    UK
    Posts
    90
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    ahh right so it's storing the same binary string regardless of the charset.

    Of course! It makes total sense!

    thanks Dan

  8. #8
    SitePoint Member
    Join Date
    Apr 2009
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    One year since this discussion happened, and I've hit the same roadblock in my application now.

    I have a form with text fields for address lines, and its only meant for US addresses, which obviously don't need full-width characters. Unfortunately, one of my users, a Japanese guy, entered English letters with full-width input enabled. This messed up the data in the database.

    Is there a way I can disable input of full-width characters in the form?
    Or maybe do some checks in the code to see if it has been entered in full-width?

    My application is Java/JSP, and I know this isn't the right forum for this post, but the discussion seemed so relevant that I couldn't resist.

    Any help will be greatly appreciated.

  9. #9
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by toyoyo View Post
    One year since this discussion happened, and I've hit the same roadblock in my application now.
    Actually, it's more like a month ago.

    Quote Originally Posted by toyoyo View Post
    Unfortunately, one of my users, a Japanese guy, entered English letters with full-width input enabled.
    UTF-8 and iso-8859-1 are compatible for basic English characters (ascii), so that can't be all there is to it. Are you sure that this Japanese guy haven't entered a name which includes non-English characters. Like - say - Japanese characters or something?

    Quote Originally Posted by toyoyo View Post
    Is there a way I can disable input of full-width characters in the form?
    Or maybe do some checks in the code to see if it has been entered in full-width?
    The best choice is to use utf-8. All your problems will go away then. Alternatively, you will have to validate user input. I'm not sure which libraries are available on the Java platform though.

  10. #10
    SitePoint Member
    Join Date
    Apr 2009
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Smile

    Quote Originally Posted by kyberfabrikken View Post
    Actually, it's more like a month ago.
    My bad.. i have this bad habit of seeing the join date of the person who wrote the post and thinking its the post date!

    Quote Originally Posted by kyberfabrikken View Post
    UTF-8 and iso-8859-1 are compatible for basic English characters (ascii), so that can't be all there is to it.
    Well, the page is set to use UTF-8 as the encoding, but then does that restrict what the user can enter in the textboxes?

    Quote Originally Posted by kyberfabrikken View Post
    Are you sure that this Japanese guy haven't entered a name which includes non-English characters. Like - say - Japanese characters or something?
    Yes, I am sure, because it is a US Postal Address that was entered. The page itself showed English text "P.O.BOX ...", but the encoding was in DBCS and the text in the application logs showed something like this:

    Code:
    P.O. BOX
    Quote Originally Posted by kyberfabrikken View Post
    The best choice is to use utf-8. All your problems will go away then. Alternatively, you will have to validate user input. I'm not sure which libraries are available on the Java platform though.
    Ok, I will do some searching for that.
    Is there any way we can validate this in Javascript, instead of doing it on the server side?

  11. #11
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by toyoyo View Post
    Well, the page is set to use UTF-8 as the encoding, but then does that restrict what the user can enter in the textboxes?
    People can enter anything into the text boxes, no matter what encoding the page is in. However, if your page is something other than utf-8, then the browser doesn't have a meaningful way of sending special characters. What it does in this case, is then to encode the characters as html-entities; That's what's happening for you as far as I can tell, which means that your page is probably served as iso-8859-1 and not utf-8.

    Quote Originally Posted by toyoyo View Post
    Is there any way we can validate this in Javascript, instead of doing it on the server side?
    You could use a regular expression to check the contents of your text-field. I think the following would work on all browsers (although I only tested in Firefox):
    Code:
    /^[\x00-\x7f]*$/
    Hook it up to the onsubmit handler of your form element.

    You shouldn't rely solely on client side validation though, since it's quite brittle.

  12. #12
    SitePoint Member
    Join Date
    Apr 2009
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    What it does in this case, is then to encode the characters as html-entities; That's what's happening for you as far as I can tell, which means that your page is probably served as iso-8859-1 and not utf-8.
    Well, I checked the HTTP headers for one of my pages, and it showed that the server is serving the page as UTF-8.

    I was also doing some digging.. and noticed the "accept-charset" attribute for the HTML form tag. I was wondering if this can be used to restrict the charset in this case.

    Quote Originally Posted by kyberfabrikken View Post
    You could use a regular expression to check the contents of your text-field.
    Thanks for the useful regex.. I tried a couple of tests with this and it works. Will try integrating that when I get back to work tomorrow.

  13. #13
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by toyoyo View Post
    Well, I checked the HTTP headers for one of my pages, and it showed that the server is serving the page as UTF-8.
    The HTTP headers are canonical, so if they are set to utf-8, then that must be it.

    Quote Originally Posted by toyoyo View Post
    I was also doing some digging.. and noticed the "accept-charset" attribute for the HTML form tag. I was wondering if this can be used to restrict the charset in this case.
    If accept-charset is set, the form will be submitted in that encoding, rather than the encoding of the page. It's rarely used.
    In your case, restricting the charset won't help you; You need to restrict the characters, not the encoding of them.

  14. #14
    SitePoint Wizard
    Join Date
    Mar 2008
    Posts
    1,149
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can always convert the full width characters to half width:
    Code js:
    str = str.replace(/[\uFF01-\uFF5E]/g, function(m) {
        return String.fromCharCode(m[0].charCodeAt(0) - 65248);
    })
    Then check for Unicode characters.

    http://unicode.org/charts/PDF/UFF00.pdf -> http://unicode.org/charts/PDF/U0000.pdf

  15. #15
    SitePoint Member
    Join Date
    Apr 2009
    Posts
    6
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks guys, I almost got it working now.

    Now for some real testing.. where do I find a Japanese guy? :P
    Just kidding. My application isn't on the internet yet, so I have to test it myself.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •