US-ASCII transliterations of Unicode text

Following on from this release, the this turned out to be easier than I thought – ported Text::Unidecode to PHP – code available here or track down the utf8_to_ascii package from the main page – released it seperately to keep with the original (Perl artistic) license while the rest of the stuff is under LGPL.

Now I first need to point out that what I’ve done is easy, compared to the amazing job Sean M. Burke has done with Text::Unidecode. You really need to read the docs to understand what it does and it’s limitations but, in short, it keeps a “database” of unicode characters and corresponding sensible US-ASCII equivalents. For example, a simple transformation would be “Zürich” to “Zuerich”, “ue” being a common replacement for “ü” in Germanic languages.

Really only came to understand how good a job Sean has done on passing this UTF-8 sampler through the PHP version – at a rough guess it did “something” for 85%+ of the non-ASCII characters in that document. Here’s some snippets to give you a feeling of before and after;

Before: *Sanskrit* /(standard transcription):/ k?ca? ?aknomyattum; nopahinasti m?m.
After:  *Sanskrit* /(standard transcription):/ kaca- saknomyattum; nopahinasti mam.

Before: *Greek*: ????? ?? ??? ???????? ?????? ????? ?? ???? ??????.
After: *Greek*: Mporo na phao spasmena gualia khoris na patho tipota.

Before: *Anglo-Saxon* /(Latin):/ Ic mæg glæs eotan ond hit ne hearmiað me.
After: *Anglo-Saxon* /(Latin):/ Ic maeg glaes eotan ond hit ne hearmiad me.

Before: *Soenderjysk*: Æ ka æe glass uhen at det go mæ naue.
After: *Soenderjysk*: AE ka aee glass uhen at det go mae naue.

Before: *Ukrainian*: ? ???? ???? ????, ? ???? ???? ?? ?????????.
After: *Ukrainian*: Ia mozhu yisti shklo, i vono mieni nie poshkodit'.

Before: *Farsi / Persian*: .?? ?? ????? ????? ????? ??? ???? ?????
After: *Farsi / Persian*: .mn my twnm bdwni Hss drd shyshh bkhwrm

Whether all of those actually make sense to a native speaker, I can’t say (feedback appreciated). I guess it depends partly on the language e.g. it’s easier to do with Greek than with Farsi. It should also be pointed out that the Text::Unidecode database (which I ported 1 to 1 – in fact there’s a script to automate it) isn’t entirely complete – for some characters and languages it has no data.

That said, if those languages are not relevant to your site, this can be a big help when you need ASCII not UTF-8. You might use this for filenames or critical “identifers” like a userid, for example, where you don’t want any risk of “phishing” or the overhead of processing UTF-8 characters. You might also consider it for search engine friendly URLs – although modern browsers largely support UTF-8 in URLs, phishing is again an issue and it may be smarter not to.

Anyway – the first PHP version “works” although no doubt it could get faster (although I doubt very much is will get as fast as the Perl version). This could also be readily ported to other languages like Python and Ruby.

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Ammar Ibrahim

    Anything for Arabic? Arabic almost uses the same “Letters” as Farsi does, how easy/hard would it be to have transliteration for arabic?

  • http://www.phppatterns.com HarryF

    Anything for Arabic?

    Will have to get back to you on that (or you could try it). If not, it would be relatively easy for an Arabic speaker to add the relevant mappings – more on that later.

  • Derick

    For people who are not aware of this, there is also a (much faster) PHP extension for this already:
    http://derickrethans.nl/translit.php

  • Ammar Ibrahim

    I’m an Arabic speaker Harry, I’d help you if that’s possible

  • Markus Wolff

    Cool, should’ve had that a few weeks before (or simply remembered the translit extension…grrr :-)), whem I wrote a really ugly function for Postgres to translate German Umlauts to plain ASCII for use with Levenshtein and the likes… ah well, time to redo things again ;-)

  • http://www.phppatterns.com HarryF

    OK – an Arabic example (you can see these for yourself BTW – extract the download somewhere to your webserver and point your browser at the “test” subdirectory).

    Before: *Arabic*: أنا قادر على أكل الزجاج و هذا لا يؤلمني.
    After: *Arabic*: ‘n qdr `l~ ‘kl lzjj w hdh l yw’lmny.

    Would be interested on your opinion of how good that is.

    For people who are not aware of this, there is also a (much faster) PHP extension for this already:
    http://derickrethans.nl/translit.php

    Derick – have you thought of doing a pure PHP interface to your database files? Think a problem for many is hosts where they can’t install new extensions (plus for those writing PHP apps for mass deployment, extension dependencies tends to be a problem).

  • Ammar Ibrahim

    Before: *Arabic*: أنا قادر على أكل الزجاج و هذا لا يؤلمني.
    After: *Arabic*: ‘n qdr `l~ ‘kl lzjj w hdh l yw’lmny.

    I think it’s not good, not even a single word :) It’s not correct to map a single letter in arabic to a letter in english, it’s quite hard to explain. but for example the sentence you provided could be something like this:
    “Ana qader ala akel alzojaj wa hatha la yo’limony”

    As you can see the sound of a basic letter is almost always accompanied with vowel.

  • Ammar Ibrahim

    and oh do you know what the sentence means, it’s weird it means “I can eat glass, and that doesn’t hurt me” wondering where you got that from

  • Anonymous

    Great stuff Harry. I can use this on a little project of mine :)

  • http://www.phppatterns.com HarryF

    and oh do you know what the sentence means, it’s weird it means “I can eat glass, and that doesn’t hurt me” wondering where you got that from

    Hmmm – not such a nice thing to teach beginners. It comes from here: http://www.columbia.edu/kermit/utf8.html

  • http://www.phppatterns.com HarryF

    I think it’s not good, not even a single word :) It’s not correct to map a single letter in arabic to a letter in english, it’s quite hard to explain. but for example the sentence you provided could be something like this:
    “Ana qader ala akel alzojaj wa hatha la yo’limony”

    As you can see the sound of a basic letter is almost always accompanied with vowel.

    That’s what I’d feared. For languages with a closer relationship to the Roman alphabet, seems to do a good job. Sean notes the limitations here;

    Text::Unidecode is meant to be a transliterator-of-last resort, to be used once you’ve decided that you can’t just display the Unicode data as is, and once you’ve decided you don’t have a more clever, language-specific transliterator available. It transliterates context-insensitively — that is, a given character is replaced with the same US-ASCII (7-bit ASCII) character or characters, no matter what the surrounding character are.

  • dusoft

    I don’t speak Ukrainian, but could understand some. The transcribed cyrillics looks OK, although I would like to hear Ukrainian on this.

  • Anonymous