Programming
Article
By Harry Fuecks

US-ASCII transliterations of Unicode text

By Harry Fuecks
Last chance to win! You'll get a... FREE 6-Month Subscription to SitePoint Premium Plus you'll go in the draw to WIN a new Macbook SitePoint 2017 Survey Yes, let's Do this It only takes 5 min

Following on from this release, the this turned out to be easier than I thought – ported Text::Unidecode to PHP – code available here or track down the utf8_to_ascii package from the main page – released it seperately to keep with the original (Perl artistic) license while the rest of the stuff is under LGPL.

Now I first need to point out that what I’ve done is easy, compared to the amazing job Sean M. Burke has done with Text::Unidecode. You really need to read the docs to understand what it does and it’s limitations but, in short, it keeps a “database” of unicode characters and corresponding sensible US-ASCII equivalents. For example, a simple transformation would be “Zürich” to “Zuerich”, “ue” being a common replacement for “ü” in Germanic languages.

Really only came to understand how good a job Sean has done on passing this UTF-8 sampler through the PHP version – at a rough guess it did “something” for 85%+ of the non-ASCII characters in that document. Here’s some snippets to give you a feeling of before and after;

Before: *Sanskrit* /(standard transcription):/ k?ca? ?aknomyattum; nopahinasti m?m.
After:  *Sanskrit* /(standard transcription):/ kaca- saknomyattum; nopahinasti mam.

Before: *Greek*: ????? ?? ??? ???????? ?????? ????? ?? ???? ??????.
After: *Greek*: Mporo na phao spasmena gualia khoris na patho tipota.

Before: *Anglo-Saxon* /(Latin):/ Ic mæg glæs eotan ond hit ne hearmiað me.
After: *Anglo-Saxon* /(Latin):/ Ic maeg glaes eotan ond hit ne hearmiad me.

Before: *Soenderjysk*: Æ ka æe glass uhen at det go mæ naue.
After: *Soenderjysk*: AE ka aee glass uhen at det go mae naue.

Before: *Ukrainian*: ? ???? ???? ????, ? ???? ???? ?? ?????????.
After: *Ukrainian*: Ia mozhu yisti shklo, i vono mieni nie poshkodit'.

Before: *Farsi / Persian*: .?? ?? ????? ????? ????? ??? ???? ?????
After: *Farsi / Persian*: .mn my twnm bdwni Hss drd shyshh bkhwrm

Whether all of those actually make sense to a native speaker, I can’t say (feedback appreciated). I guess it depends partly on the language e.g. it’s easier to do with Greek than with Farsi. It should also be pointed out that the Text::Unidecode database (which I ported 1 to 1 – in fact there’s a script to automate it) isn’t entirely complete – for some characters and languages it has no data.

That said, if those languages are not relevant to your site, this can be a big help when you need ASCII not UTF-8. You might use this for filenames or critical “identifers” like a userid, for example, where you don’t want any risk of “phishing” or the overhead of processing UTF-8 characters. You might also consider it for search engine friendly URLs – although modern browsers largely support UTF-8 in URLs, phishing is again an issue and it may be smarter not to.

Anyway – the first PHP version “works” although no doubt it could get faster (although I doubt very much is will get as fast as the Perl version). This could also be readily ported to other languages like Python and Ruby.

Login or Create Account to Comment
Login Create Account
Recommended
Sponsors
Get the most important and interesting stories in tech. Straight to your inbox, daily.
Is it good?Is it good?