<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Character Encoding: Issues with Cultural Integration</title>
	<atom:link href="http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/</link>
	<description>News, opinion, and fresh thinking for web developers and designers. The official podcast of sitepoint.com.</description>
	<lastBuildDate>Sat, 07 Nov 2009 23:35:20 -0500</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Troels Knak-Nielsen</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-795389</link>
		<dc:creator>Troels Knak-Nielsen</dc:creator>
		<pubDate>Tue, 16 Sep 2008 07:59:59 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-795389</guid>
		<description>And that&#039;s why, in the following paragraph, I say: &quot;So in these places, we would have to explicitly access the &#039;raw&#039; string, through an alternate mechanism.&quot;.

Perhaps I wasn&#039;t being clear about what this meant, so let me to illustrate with a concrete example. Assume that your legacy, latin1-only application uses the superglobals (Eg. $_GET, $_POST etc.) directly. To assure BC, you would have to stick something like this to the top of your script:


$_POST = array_map(&#039;utf8_decode&#039;, $_POST);


Obviously the above implementation is naïve, but I hope it conveys my point.

Now, since we&#039;re adding a piece of UTF-8 aware code into the application, we need this to be able to retrieve the raw, un-decoded input in those places. So let&#039;s add this:


$GLOBALS[&#039;POST_UTF8&#039;] = $_POST;
$_POST = array_map(&#039;utf8_decode&#039;, $_POST);


Now, the UTF-8 aware code can use $GLOBALS[&#039;POST_UTF8&#039;], while still keeping full BC with the legacy code (Since $_POST will only contain latin1).</description>
		<content:encoded><![CDATA[<p>And that&#8217;s why, in the following paragraph, I say: &#8220;So in these places, we would have to explicitly access the &#8216;raw&#8217; string, through an alternate mechanism.&#8221;.</p>
<p>Perhaps I wasn&#8217;t being clear about what this meant, so let me to illustrate with a concrete example. Assume that your legacy, latin1-only application uses the superglobals (Eg. $_GET, $_POST etc.) directly. To assure BC, you would have to stick something like this to the top of your script:</p>
<p>$_POST = array_map(&#8217;utf8_decode&#8217;, $_POST);</p>
<p>Obviously the above implementation is naïve, but I hope it conveys my point.</p>
<p>Now, since we&#8217;re adding a piece of UTF-8 aware code into the application, we need this to be able to retrieve the raw, un-decoded input in those places. So let&#8217;s add this:</p>
<p>$GLOBALS['POST_UTF8'] = $_POST;<br />
$_POST = array_map(&#8217;utf8_decode&#8217;, $_POST);</p>
<p>Now, the UTF-8 aware code can use $GLOBALS['POST_UTF8'], while still keeping full BC with the legacy code (Since $_POST will only contain latin1).</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Tom</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-795166</link>
		<dc:creator>Tom</dc:creator>
		<pubDate>Mon, 15 Sep 2008 12:16:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-795166</guid>
		<description>&quot;So all input must be decoded from utf-8 to latin1.&quot; - Means that you are not able to get input in other charsets.</description>
		<content:encoded><![CDATA[<p>&#8220;So all input must be decoded from utf-8 to latin1.&#8221; &#8211; Means that you are not able to get input in other charsets.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Troels Knak-Nielsen</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-794096</link>
		<dc:creator>Troels Knak-Nielsen</dc:creator>
		<pubDate>Thu, 11 Sep 2008 09:23:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-794096</guid>
		<description>&lt;blockquote&gt;Converting UTF-8 to Latin-1 means that you loose all chars not in Latin-1. That way you’re not able to handle cyrillic, greek, chinese or … chars.&lt;/blockquote&gt;
I&#039;m not sure if this comment was meant for me, but in that case I think you missed the whole point. The idea of embedding utf-8 encoded strings within a sea of latin1, is exactly to preserve the full range of unicode, that these strings have. There is no conversion to latin1 in this recipe.</description>
		<content:encoded><![CDATA[<blockquote><p>Converting UTF-8 to Latin-1 means that you loose all chars not in Latin-1. That way you’re not able to handle cyrillic, greek, chinese or … chars.</p></blockquote>
<p>I&#8217;m not sure if this comment was meant for me, but in that case I think you missed the whole point. The idea of embedding utf-8 encoded strings within a sea of latin1, is exactly to preserve the full range of unicode, that these strings have. There is no conversion to latin1 in this recipe.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Troels Knak-Nielsen</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-794082</link>
		<dc:creator>Troels Knak-Nielsen</dc:creator>
		<pubDate>Thu, 11 Sep 2008 08:46:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-794082</guid>
		<description>&lt;blockquote&gt;
By using accept-charset=&quot;utf-8&quot; you can instruct browsers to send the data encoded with UTF-8. As far as I know, browser support for this is quite decent.
&lt;/blockquote&gt;
Yes. In this case, it&#039;s redundant though, since browsers are also pretty consistent in sending back in the same encoding as they receive. It wouldn&#039;t hurt though, and it&#039;s better to be safe than sorry.

&lt;blockquote&gt;
You just should ensure you use the same encoding throughout your application (mind the backend connections) and there won’t be any real problems.
&lt;/blockquote&gt;
You&#039;re absolutely right. The premise however, is that I have a legacy application in latin1, and now need to use utf-8. Porting the entire application to utf-8 is a major undertaking, so it&#039;s out of the question. I suspect that a lot of people are in a similar situation. What I described, is a technique for coping with an imperfect world.

&lt;blockquote&gt;
You should not try to parse such potentially recursive markup with regular expressions - and even using it in examples encourages people to follow this example.
&lt;/blockquote&gt;
Good point. If someone actually uses the delimiter as part of data, then the parser would choke on it. This risk can be reduced by chosing a more unique delimiter, but it can never be solved completely. With a sufficiently unique identifier, the risk is very low, so I think I&#039;ll be bold and brush this off as an academic issue; Thanks for pointing it out though. If the problem does arise, there is a single place in the application, where the delimiter can be changed to something better than &lt;code&gt;charset:utf8&lt;/code&gt;, which - arguably - isn&#039;t very unique.</description>
		<content:encoded><![CDATA[<blockquote><p>
By using accept-charset=&#8221;utf-8&#8243; you can instruct browsers to send the data encoded with UTF-8. As far as I know, browser support for this is quite decent.
</p></blockquote>
<p>Yes. In this case, it&#8217;s redundant though, since browsers are also pretty consistent in sending back in the same encoding as they receive. It wouldn&#8217;t hurt though, and it&#8217;s better to be safe than sorry.</p>
<blockquote><p>
You just should ensure you use the same encoding throughout your application (mind the backend connections) and there won’t be any real problems.
</p></blockquote>
<p>You&#8217;re absolutely right. The premise however, is that I have a legacy application in latin1, and now need to use utf-8. Porting the entire application to utf-8 is a major undertaking, so it&#8217;s out of the question. I suspect that a lot of people are in a similar situation. What I described, is a technique for coping with an imperfect world.</p>
<blockquote><p>
You should not try to parse such potentially recursive markup with regular expressions &#8211; and even using it in examples encourages people to follow this example.
</p></blockquote>
<p>Good point. If someone actually uses the delimiter as part of data, then the parser would choke on it. This risk can be reduced by chosing a more unique delimiter, but it can never be solved completely. With a sufficiently unique identifier, the risk is very low, so I think I&#8217;ll be bold and brush this off as an academic issue; Thanks for pointing it out though. If the problem does arise, there is a single place in the application, where the delimiter can be changed to something better than <code>charset:utf8</code>, which &#8211; arguably &#8211; isn&#8217;t very unique.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Tom</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-794055</link>
		<dc:creator>Tom</dc:creator>
		<pubDate>Thu, 11 Sep 2008 07:41:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-794055</guid>
		<description>Converting UTF-8 to Latin-1 means that you loose all chars not in Latin-1. That way you&#039;re not able to handle cyrillic, greek, chinese or ... chars.

I suggest using UTF-8 internally. If you&#039;re fetching data from old parts of the application that still use Latin-1 you can convert them to UTF-8 without loosing informations.

On input you can check for the used charset and convert this to UTF-8, too.</description>
		<content:encoded><![CDATA[<p>Converting UTF-8 to Latin-1 means that you loose all chars not in Latin-1. That way you&#8217;re not able to handle cyrillic, greek, chinese or &#8230; chars.</p>
<p>I suggest using UTF-8 internally. If you&#8217;re fetching data from old parts of the application that still use Latin-1 you can convert them to UTF-8 without loosing informations.</p>
<p>On input you can check for the used charset and convert this to UTF-8, too.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: kore</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-794047</link>
		<dc:creator>kore</dc:creator>
		<pubDate>Thu, 11 Sep 2008 07:07:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-794047</guid>
		<description>AutisticCuckoo:

An encoding maps the characters of a specific character set to a sequence of bytes. Of course strings in PHP (&lt;6) may contain strings encoded with each encoding and each character set.  But there is no character set associated, and all string functions in PHP (&lt;6) just operate on &lt;em&gt;bytes&lt;/em&gt; - which actually do not much differ from characters in single-byte encodings. But there is *no* information if it is Latin1 or ISO-8859-*, or similar.</description>
		<content:encoded><![CDATA[<p>AutisticCuckoo:</p>
<p>An encoding maps the characters of a specific character set to a sequence of bytes. Of course strings in PHP (&lt;6) may contain strings encoded with each encoding and each character set.  But there is no character set associated, and all string functions in PHP (&lt;6) just operate on <em>bytes</em> &#8211; which actually do not much differ from characters in single-byte encodings. But there is *no* information if it is Latin1 or ISO-8859-*, or similar.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: AutisticCuckoo</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-794038</link>
		<dc:creator>AutisticCuckoo</dc:creator>
		<pubDate>Thu, 11 Sep 2008 06:39:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-794038</guid>
		<description>&lt;blockquote&gt;Strings in PHP do not have any charset or encoding information associated. They are just binary, like described here.&lt;/blockquote&gt;
LOL. The binary representation of characters is exactly what an encoding &lt;em&gt;is&lt;/em&gt;. The problem with PHP is that the string functions in the standard library assumes one byte per character. There are multi-byte string functions available, but then you have to choose which encoding to use.

Java and JavaScript, on the other hand, internally use 16 bits to represent each character. That means they can at least handle the BMP (basic multilingual plane) in Unicode.</description>
		<content:encoded><![CDATA[<blockquote><p>Strings in PHP do not have any charset or encoding information associated. They are just binary, like described here.</p></blockquote>
<p>LOL. The binary representation of characters is exactly what an encoding <em>is</em>. The problem with PHP is that the string functions in the standard library assumes one byte per character. There are multi-byte string functions available, but then you have to choose which encoding to use.</p>
<p>Java and JavaScript, on the other hand, internally use 16 bits to represent each character. That means they can at least handle the BMP (basic multilingual plane) in Unicode.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: kore</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-793844</link>
		<dc:creator>kore</dc:creator>
		<pubDate>Wed, 10 Sep 2008 20:19:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-793844</guid>
		<description>Hi,

a) Strings in PHP do not have &lt;em&gt;any&lt;/em&gt; charset or encoding information associated. They are just binary, like described &lt;a href=&quot;http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#which-charset-encoding-do-strings-have-in-php&quot; rel=&quot;nofollow&quot;&gt;here.&lt;/a&gt;

b) Why do you want to convert to Latin1 anyways? It might only be relevant, if you need to process strings character-wise, what should not be necessary in &quot;normal&quot; applications. If you need to do so, take a look &lt;a href=&quot;http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-iterate-characterwise-over-a-string&quot; rel=&quot;nofollow&quot;&gt;here&lt;/a&gt;.

c) For charset and encoding conversions you might want to use the iconv() functions. It does not only handle more encodings, but can also handle character set incompatibilities between encondings (transliteration, ignore).

d) The charsets/encodings browsers send can be influenced either by the encoding information given in the Content-Type headers (HTTP, HTML-meta-tags) or the form attribute already mentioned in another comment. Not mentioning, that all clients of course may send garbage which needs to be sanitized.

e) You should not try to parse such potentially recursive markup with regular expressions - and even using it in examples encourages people to follow this example. This &lt;a href=&quot;http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html&quot; rel=&quot;nofollow&quot;&gt;will never work&lt;/a&gt;.

You just should ensure you use the same encoding throughout your application (mind the backend connections) and there won&#039;t be any real problems.</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>a) Strings in PHP do not have <em>any</em> charset or encoding information associated. They are just binary, like described <a href="http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#which-charset-encoding-do-strings-have-in-php" rel="nofollow">here.</a></p>
<p>b) Why do you want to convert to Latin1 anyways? It might only be relevant, if you need to process strings character-wise, what should not be necessary in &#8220;normal&#8221; applications. If you need to do so, take a look <a href="http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#how-do-i-iterate-characterwise-over-a-string" rel="nofollow">here</a>.</p>
<p>c) For charset and encoding conversions you might want to use the iconv() functions. It does not only handle more encodings, but can also handle character set incompatibilities between encondings (transliteration, ignore).</p>
<p>d) The charsets/encodings browsers send can be influenced either by the encoding information given in the Content-Type headers (HTTP, HTML-meta-tags) or the form attribute already mentioned in another comment. Not mentioning, that all clients of course may send garbage which needs to be sanitized.</p>
<p>e) You should not try to parse such potentially recursive markup with regular expressions &#8211; and even using it in examples encourages people to follow this example. This <a href="http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html" rel="nofollow">will never work</a>.</p>
<p>You just should ensure you use the same encoding throughout your application (mind the backend connections) and there won&#8217;t be any real problems.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: AutisticCuckoo</title>
		<link>http://www.sitepoint.com/blogs/2008/09/10/issues-with-cultural-integration/comment-page-1/#comment-793686</link>
		<dc:creator>AutisticCuckoo</dc:creator>
		<pubDate>Wed, 10 Sep 2008 13:02:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=2969#comment-793686</guid>
		<description>&lt;blockquote&gt;You may or may not know this, but when submitting a form, browsers send back data in the same encoding as the page was served.&lt;/blockquote&gt;
This is the default behaviour if the &lt;code&gt;accept-charset&lt;/code&gt; attribute is omitted from the &lt;code&gt;&lt;form&gt;&lt;/code&gt; tag. By using &lt;code&gt;accept-charset=&quot;utf-8&quot;&lt;/code&gt; you can instruct browsers to send the data encoded with UTF-8. As far as I know, browser support for this is quite decent.</description>
		<content:encoded><![CDATA[<blockquote><p>You may or may not know this, but when submitting a form, browsers send back data in the same encoding as the page was served.</p></blockquote>
<p>This is the default behaviour if the <code>accept-charset</code> attribute is omitted from the <code>&lt;form&gt;</code> tag. By using <code>accept-charset="utf-8"</code> you can instruct browsers to send the data encoded with UTF-8. As far as I know, browser support for this is quite decent.</p>]]></content:encoded>
	</item>
</channel>
</rss>
