SitePoint Sponsor

User Tag List

Results 1 to 16 of 16
  1. #1
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)

    What character encoding will JS use?

    I use JavaScript to grab the title of the webpage a script runs on and send it to a server for recording by outputting an image with the webpage name in the URL (and the image is the remote script that records it).

    Something like this:
    Code javascript:
    document.write('<img src="http://www.example.com/image.php?title=' 
      + escape(document.title) 
      + '" />');

    Do I know anything about the encoding of that text when I get it? Since every page the script resides on can have different encoding, is there any way to reliably record the webpage title without bungling characters outside the US-ASCII range?

  2. #2
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Do you need to use document.write()?

    I'm not a 100% sure when you use document.write(), but otherwise javascript uses unicode internally (I think that goes for all browsers), so I think it would use UTF-8 for non-ascii characters.

  3. #3
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    Do you need to use document.write()?
    Unfortunately, yes

  4. #4
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Come to think about it, I don't think it matters a thing with document.write(). Everything in the javascript internal memory, is unicode, and escape() should return the unicode codepoint for characters. Characters, which goes beyond the range of %FF will be encoded as %uXXXX, where XXXX is the unicode codepoint.
    There may be issues, if the document doesn't specify the encoding (Eg. the browser is forced to guess which encoding the document is in), but that would also be a problem for regular use.

    Whether your server side language is able to grok %uXXXX is another story.

  5. #5
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    That's good. I'll have to revisit all my code and see where things may go wrong in the process. I may also have some DB tables using latin1 collation instead of utf8 still.

    Some good notes on handling unicode escape()'d stuff from JavaScript on the PHP end when it needs to be un-encoded: http://us2.php.net/urldecode

  6. #6
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dan Grossman View Post
    That's good. I'll have to revisit all my code and see where things may go wrong in the process. I may also have some DB tables using latin1 collation instead of utf8 still.
    That's probably a typo on your behalf, but I just want to make sure, that you don't confuse collation with encoding; They are two separate matters.

  7. #7
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    That's probably a typo on your behalf, but I just want to make sure, that you don't confuse collation with encoding; They are two separate matters.
    Sorry, you're right, I meant that I have some tables where the character set is latin1 (the server default when none was specified in the CREATE TABLE query).

    Although .. if I start storing the text while still url-encoded by JavaScript's escape(), it's probably safe in a latin1 table as long as I'm dealing with utf8 on the page displaying it.

  8. #8
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,810
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    escape() is deprecated because it does NOT support Unicode - it only supports ASCII. The encodeURI() and encodeURIC0mponent() functions are the replacements that do support Unicode.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">

  9. #9
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by felgall View Post
    escape() is deprecated because it does NOT support Unicode - it only supports ASCII. The encodeURI() and encodeURIC0mponent() functions are the replacements that do support Unicode.
    Awesome tip. I'll be updating a bunch of code to use encodeURIComponent instead after your advice brought me here:

    http://xkr.us/articles/javascript/encode-compare/

  10. #10
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    This seems to work pretty well:

    - In the JS, check of encodeURIComponent is available, if so, use it, else escape

    - On the receiving end, convert to HTML entities:
    PHP Code:
    htmlentities($stringENT_NOQUOTES'UTF-8'); 
    - The database doesn't need to have a utf8 character set since the upper range characters are encoded as HTML entities

    - On display, page character set is UTF8 and the entities render correctly

    Seems to be working fine now, see any flaws?

  11. #11
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by felgall View Post
    The encodeURI() and encodeURIC0mponent() functions are the replacements that do support Unicode.
    Thanks - That's good information.

    Quote Originally Posted by Dan Grossman View Post
    - On the receiving end, convert to HTML entities:
    PHP Code:
    htmlentities($stringENT_NOQUOTES'UTF-8'); 
    Seems to be working fine now, see any flaws?
    If you decode the string in PHP, it will be temporarily represented in the internal charset, before you call htmlentities. Since PHP uses ISO-8859-1 internally, you'd be messing up anything outside this range.

  12. #12
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    Thanks - That's good information.



    If you decode the string in PHP, it will be temporarily represented in the internal charset, before you call htmlentities. Since PHP uses ISO-8859-1 internally, you'd be messing up anything outside this range.
    Gah, that's strange. And rude of PHP! It shouldn't convert strings before I use them

    Now thinking about passing a reference so the string doesn't get converted before I can deal with it. Something like...
    PHP Code:
    call_user_func_array('mb_convert_encoding', array(&$_GET['string'], 'HTML-ENTITIES' 'UTF-8')); 
    Last edited by Dan Grossman; May 22, 2007 at 10:58.

  13. #13
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The problem is, that PHP automatically decodes the input (The URI-encoded string) to the internal charset. That's why you can access it through $_GET. And that's why you can treat them as string inside PHP. You'll need some way to go directly from URI-encoded string to html-entities, so they are never decoded into PHP strings.
    From what I can deduce from the PHP manual, urldecode doesn't recognise unicode entities (%uXXXX) at all, so I would guess that they simply pass through as such.
    Since the format %uXXXX holds the unicode codepoint, you can simply replace %uXXXX with &#XXXX;, to get the corresponding html entities. Of course, you'll still have the problem that you can't really manipulate these strings (or sort them properly), since they are encoded. But at least they'll be displayed correctly.

  14. #14
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Hrm.. it seems to be handling it ok.. feel free to smack me if I'm wrong

    Am I correct that the Greek characters are outside the 8859-1 range but are within UTF-8?

    I added one (copied a Greek delta from a unicode table) onto the end of a page title and tested. This is what comes through $_GET after encodeUriComponent:

    W3Counter Blog &#194;&#187; Blog Archive &#194;&#187; W3Counter 4 Now Online &#206;&#180;

    Garbage. Here's what I get after simply calling htmlentities with utf-8 encoding parameter directly on the same exact $_GET string:

    W3Counter Blog &raquo; Blog Archive &raquo; W3Counter 4 Now Online &delta;

    Perfect. No loss. Stores in a latin1 table without losing the characters since the HTML entity coding will fall within latin1.

    Am I missing the problem?

    Off Topic:

    BTW, about manipulating/sorting... this is all just display. It's about making reports prettier by using the webpage title when linking to a page instead of just the bare URL.

  15. #15
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Ah - Looks like PHP is smart enough to decode unicode to a UTF-8 encoded string, for those characters outside the ISO-8859-1 range. If that's the case, it should work fine if you pass it through htmlentities, and tell that you're feeding it UTF-8. I guess I was just jumping to conclusions -- Sorry for the confusion there.

  16. #16
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,580
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Off Topic:

    Man, Google is fast. This thread is now #2 on Google for "javascript internal unicode". Now I'm kinda happy Google sees member signatures again.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •