Scripters UTF-8 Survival Guide (slides)

By | | PHP

Following on from here, the presentation is now available here (PDF). Related recent discovery is TCPDF – basically a fork of FPDF (a pure-PHP PDF creator) but comes bundled with fonts that can handle a significant chunk of Unicode.

Anyway – many thanks to the local.ch team for hosting us.

Written By:

Harry Fuecks

Harry has been working in corporate IT since 1994, with everything from start-ups to Fortune 100 companies. Outside of office hours he runs phpPatterns: a site dedicated to software design with PHP that aims to raise standards of PHP development. He also maintains Dynamically Typed: SitePoint's PHP blog.

 

{ 9 comments }

Anonymous September 4, 2008 at 6:01 pm

gj
khojhjhjl
iouhihiihihhhih
jhgkjkjkkllliouoiuoiu
iuynjknkjnj,lkjiujhkljkjllj
,mnkbvtfuyghghghghghggghghghghg
kjhkjkjkjkkjkjkjkjkjjkjkjkjkjkjkjkjj
.,m,./.,m/.,m.,m/.,/.,m.,m./,m.,m/.,.,m./
876876876876866667868976666768668667686898886
iojkjkljljljljljljpiuokaaaaaaaaa

Anonymous July 29, 2008 at 3:49 pm

daniel August 12, 2006 at 2:44 am

Putting this in your .htaccess file should fix any UTF-8 errors w/ funny characters and propper displaying of utf-8:

php_value output_buffering on
php_value output_handler mb_output_handler
php_value mbstring.http_output UTF-8

HarryF August 10, 2006 at 12:29 am

OK – I’m blind ;)

Sorccu August 9, 2006 at 11:44 pm

Alright! That badly needs documenting in fact although now you mention it ..

http://www.php.net/manual/en/function.iconv.php

If you append the string //TRANSLIT to out_charset transliteration is activated. This means that when a character can’t be represented in the target charset, it can be approximated through one or several similarly looking characters. If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. Otherwise, str is cut from the first illegal character.

HarryF August 9, 2006 at 10:57 pm

you can clean it with iconv the following way:

$t = iconv(“UTF-8″,”UTF-8//IGNORE”,$t);

Alright! That badly needs documenting in fact although now you mention it, it’s documented here: http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html (i.e. $ man iconv_open ). Interesting – needs to try that //TRANSLIT flag …

chregu August 9, 2006 at 9:49 pm

you can clean it with iconv the following way:

$t = iconv(“UTF-8″,”UTF-8//IGNORE”,$t);

From http://blog.bitflux.ch/archive/2005/01/24/how-to-get-rid-of-invalid-utf-8-characters.html

:)

HarryF August 9, 2006 at 6:48 pm

Think Patrice’s tip on UTF-8 validation needs repeating – nice “hack” I hadn’t thought of.

If you want to make sure incoming UTF-8 is valid UTF-8, use iconv to convert it from UTF-8 to UTF-8. You can also potentially use iconv to clean the input.

PHP’s iconv extension raises an error notice if the input and returns only the portion of the input up to the first invalid (non UTF-8) byte it finds. Sadly there doesn’t seem to be a way to put it into “cleaning” mode, so it can only be used for validation. An example;


if ( $input != @iconv("UTF-8", "UTF-8", $input) ) {
die("Bad utf-8\n");
}

Meanwhile, the command line interface to iconv allows you to enable “cleaning” – iconv silently drops any bad bytes it finds. E.g.


$ iconv -c -f UTF-8 -t UTF-8 some_utf-8_encoded_file.txt

Patrice August 9, 2006 at 5:27 pm

Thank you Harry for doing the presentation. Was really superb!

Comments on this entry are closed.

{ 1 trackback }