Blog Post RSS ?

Blogs » PHP » Scripters UTF-8 Survival Guide (slides)
 

Scripters UTF-8 Survival Guide (slides)

by Harry Fuecks

Following on from here, the presentation is now available here (PDF). Related recent discovery is TCPDF - basically a fork of FPDF (a pure-PHP PDF creator) but comes bundled with fonts that can handle a significant chunk of Unicode.

Anyway - many thanks to the local.ch team for hosting us.

If you liked this blog, share the love:

  • Save to Del.icio.us

This post has 8 responses so far

  1. Thank you Harry for doing the presentation. Was really superb!

     
  2. Think Patrice’s tip on UTF-8 validation needs repeating - nice “hack” I hadn’t thought of.

    If you want to make sure incoming UTF-8 is valid UTF-8, use iconv to convert it from UTF-8 to UTF-8. You can also potentially use iconv to clean the input.

    PHP’s iconv extension raises an error notice if the input and returns only the portion of the input up to the first invalid (non UTF-8) byte it finds. Sadly there doesn’t seem to be a way to put it into “cleaning” mode, so it can only be used for validation. An example;

    if ( $input != @iconv("UTF-8", "UTF-8", $input) ) { die("Bad utf-8\n"); }

    Meanwhile, the command line interface to iconv allows you to enable “cleaning” - iconv silently drops any bad bytes it finds. E.g.

    $ iconv -c -f UTF-8 -t UTF-8 some_utf-8_encoded_file.txt
     
  3. you can clean it with iconv the following way:

    $t = iconv(”UTF-8″,”UTF-8//IGNORE”,$t);

    From http://blog.bitflux.ch/archive/2005/01/24/how-to-get-rid-of-invalid-utf-8-characters.html

    :)

     
  4. you can clean it with iconv the following way:

    $t = iconv(”UTF-8″,”UTF-8//IGNORE”,$t);

    Alright! That badly needs documenting in fact although now you mention it, it’s documented here: http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html (i.e. $ man iconv_open ). Interesting - needs to try that //TRANSLIT flag …

     
  5. Alright! That badly needs documenting in fact although now you mention it ..

    http://www.php.net/manual/en/function.iconv.php

    If you append the string //TRANSLIT to out_charset transliteration is activated. This means that when a character can’t be represented in the target charset, it can be approximated through one or several similarly looking characters. If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. Otherwise, str is cut from the first illegal character.

     
  6. OK - I’m blind ;)

     
  7. […] As a result of all the noise about UTF-8, got an email from Marek Gayer with some very smart tips on handling UTF-8. What follows is a discussion illustrating what happens when you get obsessed with performance and optimizations (be warned—may be boring, depending on your perspective). […]

     
  8. Putting this in your .htaccess file should fix any UTF-8 errors w/ funny characters and propper displaying of utf-8:

    php_value output_buffering on
    php_value output_handler mb_output_handler
    php_value mbstring.http_output UTF-8

     

Sponsored Links

Leave a response

You are not logged in, log in with your SitePoint Forum username and password.

-OR- Post Anonymously

* Make sure any code samples are escaped (i.e. ‘<b>’ becomes ‘&lt;b&gt;’).

If not logged in, your comments will be placed in a moderation queue. This means your comment may not appear until one of our moderators approves it.

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Logo Design, Web page Design and more!

99designs

  • Custom logo designs created ‘just for you’.
  • Pick the design you like best.
  • Only pay if you’re satisfied with the result.