SitePoint Sponsor |
|
User Tag List
Results 1 to 16 of 16
-
May 21, 2007, 11:52 #1
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
What character encoding will JS use?
I use JavaScript to grab the title of the webpage a script runs on and send it to a server for recording by outputting an image with the webpage name in the URL (and the image is the remote script that records it).
Something like this:
Code javascript:document.write('<img src="http://www.example.com/image.php?title=' + escape(document.title) + '" />');
Do I know anything about the encoding of that text when I get it? Since every page the script resides on can have different encoding, is there any way to reliably record the webpage title without bungling characters outside the US-ASCII range?Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 21, 2007, 13:44 #2
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Do you need to use document.write()?
I'm not a 100% sure when you use document.write(), but otherwise javascript uses unicode internally (I think that goes for all browsers), so I think it would use UTF-8 for non-ascii characters.
-
May 21, 2007, 13:52 #3
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 21, 2007, 14:07 #4
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Come to think about it, I don't think it matters a thing with document.write(). Everything in the javascript internal memory, is unicode, and escape() should return the unicode codepoint for characters. Characters, which goes beyond the range of %FF will be encoded as %uXXXX, where XXXX is the unicode codepoint.
There may be issues, if the document doesn't specify the encoding (Eg. the browser is forced to guess which encoding the document is in), but that would also be a problem for regular use.
Whether your server side language is able to grok %uXXXX is another story.
-
May 21, 2007, 14:18 #5
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
That's good. I'll have to revisit all my code and see where things may go wrong in the process. I may also have some DB tables using latin1 collation instead of utf8 still.
Some good notes on handling unicode escape()'d stuff from JavaScript on the PHP end when it needs to be un-encoded: http://us2.php.net/urldecodeTry Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 21, 2007, 14:35 #6
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
-
May 21, 2007, 15:36 #7
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
Sorry, you're right, I meant that I have some tables where the character set is latin1 (the server default when none was specified in the CREATE TABLE query).
Although .. if I start storing the text while still url-encoded by JavaScript's escape(), it's probably safe in a latin1 table as long as I'm dealing with utf8 on the page displaying it.Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 21, 2007, 15:36 #8
- Join Date
- Sep 2005
- Location
- Sydney, NSW, Australia
- Posts
- 16,875
- Mentioned
- 25 Post(s)
- Tagged
- 1 Thread(s)
escape() is deprecated because it does NOT support Unicode - it only supports ASCII. The encodeURI() and encodeURIC0mponent() functions are the replacements that do support Unicode.
Stephen J Chapman
javascriptexample.net, Book Reviews, follow me on Twitter
HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
<input name="html5" type="text" required pattern="^$">
-
May 21, 2007, 15:41 #9
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
Awesome tip. I'll be updating a bunch of code to use encodeURIComponent instead after your advice brought me here:
http://xkr.us/articles/javascript/encode-compare/Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 22, 2007, 00:09 #10
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
This seems to work pretty well:
- In the JS, check of encodeURIComponent is available, if so, use it, else escape
- On the receiving end, convert to HTML entities:
PHP Code:htmlentities($string, ENT_NOQUOTES, 'UTF-8');
- On display, page character set is UTF8 and the entities render correctly
Seems to be working fine now, see any flaws?Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 22, 2007, 06:38 #11
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Thanks - That's good information.
If you decode the string in PHP, it will be temporarily represented in the internal charset, before you call htmlentities. Since PHP uses ISO-8859-1 internally, you'd be messing up anything outside this range.
-
May 22, 2007, 10:24 #12
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
Gah, that's strange. And rude of PHP! It shouldn't convert strings before I use them
Now thinking about passing a reference so the string doesn't get converted before I can deal with it. Something like...
PHP Code:call_user_func_array('mb_convert_encoding', array(&$_GET['string'], 'HTML-ENTITIES' , 'UTF-8'));
Last edited by Dan Grossman; May 22, 2007 at 10:58.
Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 22, 2007, 11:18 #13
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
The problem is, that PHP automatically decodes the input (The URI-encoded string) to the internal charset. That's why you can access it through $_GET. And that's why you can treat them as string inside PHP. You'll need some way to go directly from URI-encoded string to html-entities, so they are never decoded into PHP strings.
From what I can deduce from the PHP manual, urldecode doesn't recognise unicode entities (%uXXXX) at all, so I would guess that they simply pass through as such.
Since the format %uXXXX holds the unicode codepoint, you can simply replace %uXXXX with &#XXXX;, to get the corresponding html entities. Of course, you'll still have the problem that you can't really manipulate these strings (or sort them properly), since they are encoded. But at least they'll be displayed correctly.
-
May 22, 2007, 11:40 #14
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
Hrm.. it seems to be handling it ok.. feel free to smack me if I'm wrong
Am I correct that the Greek characters are outside the 8859-1 range but are within UTF-8?
I added one (copied a Greek delta from a unicode table) onto the end of a page title and tested. This is what comes through $_GET after encodeUriComponent:
W3Counter Blog » Blog Archive » W3Counter 4 Now Online δ
Garbage. Here's what I get after simply calling htmlentities with utf-8 encoding parameter directly on the same exact $_GET string:
W3Counter Blog » Blog Archive » W3Counter 4 Now Online δ
Perfect. No loss. Stores in a latin1 table without losing the characters since the HTML entity coding will fall within latin1.
Am I missing the problem?
Off Topic:
BTW, about manipulating/sorting... this is all just display. It's about making reports prettier by using the webpage title when linking to a page instead of just the bare URL.Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
-
May 22, 2007, 11:57 #15
- Join Date
- Jun 2004
- Location
- Copenhagen, Denmark
- Posts
- 6,157
- Mentioned
- 0 Post(s)
- Tagged
- 0 Thread(s)
Ah - Looks like PHP is smart enough to decode unicode to a UTF-8 encoded string, for those characters outside the ISO-8859-1 range. If that's the case, it should work fine if you pass it through htmlentities, and tell that you're feeding it UTF-8. I guess I was just jumping to conclusions -- Sorry for the confusion there.
-
May 22, 2007, 13:39 #16
- Join Date
- Aug 2000
- Location
- Philadephia, PA
- Posts
- 20,578
- Mentioned
- 1 Post(s)
- Tagged
- 0 Thread(s)
Off Topic:
Man, Google is fast. This thread is now #2 on Google for "javascript internal unicode". Now I'm kinda happy Google sees member signatures again.Try Improvely, your online marketing dashboard.
→ Conversion tracking, click fraud detection, A/B testing and more
Bookmarks