Blog Post RSS ?

Blogs » PHP » US-ASCII transliterations of Unicode text
 

US-ASCII transliterations of Unicode text

by Harry Fuecks

Following on from this release, the this turned out to be easier than I thought – ported Text::Unidecode to PHP – code available here or track down the utf8_to_ascii package from the main page – released it seperately to keep with the original (Perl artistic) license while the rest of the stuff is under LGPL.

Now I first need to point out that what I’ve done is easy, compared to the amazing job Sean M. Burke has done with Text::Unidecode. You really need to read the docs to understand what it does and it’s limitations but, in short, it keeps a “database” of unicode characters and corresponding sensible US-ASCII equivalents. For example, a simple transformation would be “Zürich” to “Zuerich”, “ue” being a common replacement for “ü” in Germanic languages.

Really only came to understand how good a job Sean has done on passing this UTF-8 sampler through the PHP version – at a rough guess it did “something” for 85%+ of the non-ASCII characters in that document. Here’s some snippets to give you a feeling of before and after;

Before: *Sanskrit* /(standard transcription):/ kācaṃ śaknomyattum; nopahinasti mām.
After:  *Sanskrit* /(standard transcription):/ kaca- saknomyattum; nopahinasti mam.

Before: *Greek*: Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα.
After: *Greek*: Mporo na phao spasmena gualia khoris na patho tipota.

Before: *Anglo-Saxon* /(Latin):/ Ic mæg glæs eotan ond hit ne hearmiað me.
After: *Anglo-Saxon* /(Latin):/ Ic maeg glaes eotan ond hit ne hearmiad me.

Before: *Soenderjysk*: Æ ka æe glass uhen at det go mæ naue.
After: *Soenderjysk*: AE ka aee glass uhen at det go mae naue.

Before: *Ukrainian*: Я можу їсти шкло, й воно мені не пошкодить.
After: *Ukrainian*: Ia mozhu yisti shklo, i vono mieni nie poshkodit'.

Before: *Farsi / Persian*: .من می توانم بدونِ احساس درد شيشه بخورم
After: *Farsi / Persian*: .mn my twnm bdwni Hss drd shyshh bkhwrm

Whether all of those actually make sense to a native speaker, I can’t say (feedback appreciated). I guess it depends partly on the language e.g. it’s easier to do with Greek than with Farsi. It should also be pointed out that the Text::Unidecode database (which I ported 1 to 1 – in fact there’s a script to automate it) isn’t entirely complete – for some characters and languages it has no data.

That said, if those languages are not relevant to your site, this can be a big help when you need ASCII not UTF-8. You might use this for filenames or critical “identifers” like a userid, for example, where you don’t want any risk of “phishing” or the overhead of processing UTF-8 characters. You might also consider it for search engine friendly URLs – although modern browsers largely support UTF-8 in URLs, phishing is again an issue and it may be smarter not to.

Anyway – the first PHP version “works” although no doubt it could get faster (although I doubt very much is will get as fast as the Perl version). This could also be readily ported to other languages like Python and Ruby.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Ping.fm
  • Twitthis

Related posts:

  1. 7 Dummy Text Generators: Which Is Your Favorite? Text generators allow you to create filler text for your...
  2. Two Quick Illustrator Text Effects Jennifer shows you how to create two text effects in...
  3. 18 Free Text Editors To Clean Up Your Code Have you ever considered using a free text editor for...
  4. Adobe Releases Beta Text Layout Framework Adobe's open source project Text Layout Framework offers greater control...
  5. The Future Of Books? iPhone App Brings Text, Audio & Video Together Nick Cave's new book "The Death of Bunny Munro" is...

This post has 13 responses so far

Sponsored Links

SitePoint Marketplace

Buy and sell Websites, templates, domain names, hosting, graphics and more.

Follow SitePoint on...