<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: UTF-8 Survival at webtuesday.ch</title>
	<atom:link href="http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/</link>
	<description></description>
	<pubDate>Sun, 07 Sep 2008 02:55:54 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: Nerd Fish &#187; Blog Archive &#187; UTF-8 Email in PHP with eZ Components</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-45726</link>
		<dc:creator>Nerd Fish &#187; Blog Archive &#187; UTF-8 Email in PHP with eZ Components</dc:creator>
		<pubDate>Tue, 15 Aug 2006 04:55:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-45726</guid>
		<description>[...] One of the subjects I brushed over last week was how you handle UTF-8 in email, because I don&#8217;t have a full picture on the best way to solve this. The fundamental problem is summarized nicely on Wikipedia&#8217;s discussion of MIME;  The basic Internet e-mail transmission protocol, SMTP, supports only 7-bit ASCII characters [&#8230;]. This effectively limits Internet e-mail to messages which, when transmitted, include only the characters sufficient for writing a small number of languages, primarily English. Other languages based on the Latin alphabet typically include diacritics not supported in 7-bit ASCII, meaning text in these languages cannot be correctly represented in basic e-mail. [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] One of the subjects I brushed over last week was how you handle UTF-8 in email, because I don&#8217;t have a full picture on the best way to solve this. The fundamental problem is summarized nicely on Wikipedia&#8217;s discussion of MIME;  The basic Internet e-mail transmission protocol, SMTP, supports only 7-bit ASCII characters [&#8230;]. This effectively limits Internet e-mail to messages which, when transmitted, include only the characters sufficient for writing a small number of languages, primarily English. Other languages based on the Latin alphabet typically include diacritics not supported in 7-bit ASCII, meaning text in these languages cannot be correctly represented in basic e-mail. [&#8230;]</p>]]></content:encoded>
	</item>
	<item>
		<title>By: SitePoint Blogs &#187; UTF-8 Email in PHP with eZ Components</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-45616</link>
		<dc:creator>SitePoint Blogs &#187; UTF-8 Email in PHP with eZ Components</dc:creator>
		<pubDate>Mon, 14 Aug 2006 14:06:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-45616</guid>
		<description>[...] One of the subjects I brushed over last week was how you handle UTF-8 in email, because I don&#8217;t have a full picture on the best way to solve this. The fundamental problem is summarized nicely on Wikipedia&#8217;s discussion of MIME;  The basic Internet e-mail transmission protocol, SMTP, supports only 7-bit ASCII characters [&#8230;]. This effectively limits Internet e-mail to messages which, when transmitted, include only the characters sufficient for writing a small number of languages, primarily English. Other languages based on the Latin alphabet typically include diacritics not supported in 7-bit ASCII, meaning text in these languages cannot be correctly represented in basic e-mail. [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] One of the subjects I brushed over last week was how you handle UTF-8 in email, because I don&#8217;t have a full picture on the best way to solve this. The fundamental problem is summarized nicely on Wikipedia&#8217;s discussion of MIME;  The basic Internet e-mail transmission protocol, SMTP, supports only 7-bit ASCII characters [&#8230;]. This effectively limits Internet e-mail to messages which, when transmitted, include only the characters sufficient for writing a small number of languages, primarily English. Other languages based on the Latin alphabet typically include diacritics not supported in 7-bit ASCII, meaning text in these languages cannot be correctly represented in basic e-mail. [&#8230;]</p>]]></content:encoded>
	</item>
	<item>
		<title>By: HarryF</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-44009</link>
		<dc:creator>HarryF</dc:creator>
		<pubDate>Wed, 09 Aug 2006 13:08:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-44009</guid>
		<description>&lt;blockquote&gt;
Just reading the presentation, seems to be missing information on _charset_ , which most of the recent browsers now support (IE, Opera &#38; FireFox). Which is provided by the browser explicitly telling which charset was used to encode the form payload.
&lt;/blockquote&gt;

That's a good point (in fact I did mention it in the talk but it's not in the slides) - to date that's something I haven't played with first hand, just read about. What also interests me is the full story on conditions under which browsers would ignore the form accept-encoding="utf-8" attribute (if any)</description>
		<content:encoded><![CDATA[<blockquote><p>
Just reading the presentation, seems to be missing information on _charset_ , which most of the recent browsers now support (IE, Opera &amp; FireFox). Which is provided by the browser explicitly telling which charset was used to encode the form payload.
</p></blockquote>
<p>That&#8217;s a good point (in fact I did mention it in the talk but it&#8217;s not in the slides) - to date that&#8217;s something I haven&#8217;t played with first hand, just read about. What also interests me is the full story on conditions under which browsers would ignore the form accept-encoding=&#8221;utf-8&#8243; attribute (if any)</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Ren</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-44004</link>
		<dc:creator>Ren</dc:creator>
		<pubDate>Wed, 09 Aug 2006 12:49:59 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-44004</guid>
		<description>Just reading the presentation, seems to be missing information on _charset_ , which most of the recent browsers now support (IE, Opera &#38; FireFox). Which is provided by the browser explicitly telling which charset was used to encode the form payload. 

https://bugzilla.mozilla.org/show_bug.cgi?id=18643
http://whatwg.org/specs/web-forms/current-work/#the-charset</description>
		<content:encoded><![CDATA[<p>Just reading the presentation, seems to be missing information on _charset_ , which most of the recent browsers now support (IE, Opera &amp; FireFox). Which is provided by the browser explicitly telling which charset was used to encode the form payload. </p>
<p><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=18643" rel="nofollow">https://bugzilla.mozilla.org/show_bug.cgi?id=18643</a><br />
<a href="http://whatwg.org/specs/web-forms/current-work/#the-charset" rel="nofollow">http://whatwg.org/specs/web-forms/current-work/#the-charset</a></p>]]></content:encoded>
	</item>
	<item>
		<title>By: SitePoint Blogs &#187; Scripters UTF-8 Survival Guide (slides)</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-43860</link>
		<dc:creator>SitePoint Blogs &#187; Scripters UTF-8 Survival Guide (slides)</dc:creator>
		<pubDate>Tue, 08 Aug 2006 23:30:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-43860</guid>
		<description>[...] Blog Post   Blogs &#187;  PHP &#187; Scripters UTF-8 Survival Guide (slides)   &#171; UTF-8 Survival at webtuesday.ch   &#160; [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Blog Post   Blogs &#187;  PHP &#187; Scripters UTF-8 Survival Guide (slides)   &laquo; UTF-8 Survival at webtuesday.ch   &nbsp; [&#8230;]</p>]]></content:encoded>
	</item>
	<item>
		<title>By: silvanm</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-43848</link>
		<dc:creator>silvanm</dc:creator>
		<pubDate>Tue, 08 Aug 2006 22:39:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-43848</guid>
		<description>Great talk. Thanks, Harry.</description>
		<content:encoded><![CDATA[<p>Great talk. Thanks, Harry.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Jason Batten</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-43562</link>
		<dc:creator>Jason Batten</dc:creator>
		<pubDate>Tue, 08 Aug 2006 14:03:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-43562</guid>
		<description>[url=http://www.amazon.com/gp/product/0131867164/104-6815758-1867124?v=glance&#38;n=283155]Core Web Application Development with PHP and MySQL[/url] by Marc Marc Wandschneider has a great chapter called "Strings and Characters of the World". The whole book is great as focuses on development using UTF-8.</description>
		<content:encoded><![CDATA[<p>[url=http://www.amazon.com/gp/product/0131867164/104-6815758-1867124?v=glance&amp;n=283155]Core Web Application Development with PHP and MySQL[/url] by Marc Marc Wandschneider has a great chapter called &#8220;Strings and Characters of the World&#8221;. The whole book is great as focuses on development using UTF-8.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: HarryF</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-43558</link>
		<dc:creator>HarryF</dc:creator>
		<pubDate>Tue, 08 Aug 2006 13:41:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-43558</guid>
		<description>&lt;blockquote&gt;
I wonder what the performance overhead is here, and whether it could be improved on by a userland PHP function?
&lt;/blockquote&gt;

Initial experiments suggest not. It is tempting to consider an alternative, stripped down C implementation though.</description>
		<content:encoded><![CDATA[<blockquote><p>
I wonder what the performance overhead is here, and whether it could be improved on by a userland PHP function?
</p></blockquote>
<p>Initial experiments suggest not. It is tempting to consider an alternative, stripped down C implementation though.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: HarryF</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-43555</link>
		<dc:creator>HarryF</dc:creator>
		<pubDate>Tue, 08 Aug 2006 13:18:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-43555</guid>
		<description>&lt;blockquote&gt;
You might need to use numeric Unicode character entities instead of the actual UTF-8 characters in HTML e-mails, though: Some web-based e-mail clients will display the e-mail embedded within an ISO-8859-1 web page (ignoring you e-mail’s character set). In this environment, Unicode characters will look broken, but numeric character entities seem to work fine.
&lt;/blockquote&gt;

Have yet to entirely figure out what the perfect world solution for when you've got UTF-8 and want to use it an email. Aside from HTML email, when you want to place UTF-8 in the subject / body of a text email, do you use base64 or quoted-pritable, combined with right mime headers. Or perhaps convert to UTF-7 (for mail servers that support only 7-bit encoding)? Will be cunningly skipping over that tonight ;)

&lt;blockquote&gt;
What about “less than” and “more than” chars? So I think you generally still need to do some kind of encoding.
&lt;/blockquote&gt;

You're right - should have been more explicit with wording - for the "special five" that are part of XML / HTML markup, you still need htmlspecialchars() - mentioned &lt;a href="http://www.phpwact.org/php/i18n/charsets#entities" rel="nofollow"&gt;here&lt;/a&gt;.

Actually that pops up an interesting side note - was browsing the &lt;a href="http://cvs.php.net/viewvc.cgi/php-src/ext/standard/html.c?view=markup" rel="nofollow"&gt;PHP source&lt;/a&gt; that implements htmlspecialchars() and htmlentities(), trying to figure out whether htmlspecialchars() would really be OK with UTF-8, without explicitly declaring it.

In short, both functions are wrappers around the same underlying code and there's a ton of stuff happening here (hash table lookups, locale checks etc. etc.).

Given that htmlspecialchars() is a function that tends to get used alot and that it's offering pretty simple functionality, I wonder what the performance overhead is here, and whether it could be improved on by a userland PHP function?</description>
		<content:encoded><![CDATA[<blockquote><p>
You might need to use numeric Unicode character entities instead of the actual UTF-8 characters in HTML e-mails, though: Some web-based e-mail clients will display the e-mail embedded within an ISO-8859-1 web page (ignoring you e-mail’s character set). In this environment, Unicode characters will look broken, but numeric character entities seem to work fine.
</p></blockquote>
<p>Have yet to entirely figure out what the perfect world solution for when you&#8217;ve got UTF-8 and want to use it an email. Aside from HTML email, when you want to place UTF-8 in the subject / body of a text email, do you use base64 or quoted-pritable, combined with right mime headers. Or perhaps convert to UTF-7 (for mail servers that support only 7-bit encoding)? Will be cunningly skipping over that tonight ;)</p>
<blockquote><p>
What about “less than” and “more than” chars? So I think you generally still need to do some kind of encoding.
</p></blockquote>
<p>You&#8217;re right - should have been more explicit with wording - for the &#8220;special five&#8221; that are part of XML / HTML markup, you still need htmlspecialchars() - mentioned <a href="http://www.phpwact.org/php/i18n/charsets#entities" rel="nofollow">here</a>.</p>
<p>Actually that pops up an interesting side note - was browsing the <a href="http://cvs.php.net/viewvc.cgi/php-src/ext/standard/html.c?view=markup" rel="nofollow">PHP source</a> that implements htmlspecialchars() and htmlentities(), trying to figure out whether htmlspecialchars() would really be OK with UTF-8, without explicitly declaring it.</p>
<p>In short, both functions are wrappers around the same underlying code and there&#8217;s a ton of stuff happening here (hash table lookups, locale checks etc. etc.).</p>
<p>Given that htmlspecialchars() is a function that tends to get used alot and that it&#8217;s offering pretty simple functionality, I wonder what the performance overhead is here, and whether it could be improved on by a userland PHP function?</p>]]></content:encoded>
	</item>
	<item>
		<title>By: R. U. Serious</title>
		<link>http://www.sitepoint.com/blogs/2006/08/08/utf-8-survival-at-webtuesdaych/#comment-43551</link>
		<dc:creator>R. U. Serious</dc:creator>
		<pubDate>Tue, 08 Aug 2006 12:48:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.sitepoint.com/blogs/?p=1671#comment-43551</guid>
		<description>You need to _escape_ three characters ampersand, less and greater. Has nothing to do with encoding, unicode or charactersets.

I agree with Harry. The only problem I've ancountered is half-knowledge by a few users of the software who were conditioned to think that if non-ascii is not represented with htmlentities then it must be broken (this double negative actually made sense).

The "vi" problem is easily fixed by using an up-to-date distribution with unicode support. ;)</description>
		<content:encoded><![CDATA[<p>You need to _escape_ three characters ampersand, less and greater. Has nothing to do with encoding, unicode or charactersets.</p>
<p>I agree with Harry. The only problem I&#8217;ve ancountered is half-knowledge by a few users of the software who were conditioned to think that if non-ascii is not represented with htmlentities then it must be broken (this double negative actually made sense).</p>
<p>The &#8220;vi&#8221; problem is easily fixed by using an up-to-date distribution with unicode support. ;)</p>]]></content:encoded>
	</item>
</channel>
</rss>
