Localizing PHP Applications “The Right Way”, Part 2

This entry is part 1 of 5 in the series Localizing PHP Applications "The Right Way"

Localizing PHP Applications "The Right Way"

Welcome back to this series of articles which teach you how to localize your PHP applications using gettext and its PHP extension. In Part 1 you took your first steps towards towards this by installing gettext and Poedit, creating a translation file, and writing a Hello World script. In this part you’ll lean about each of the function used in the script, and dive more into the gettext library and its usage.

The “Hello World” Script

To review, Part 1 showed you the following script as TestI18N/test-locale.php:

<?php
// I18N support information here
$language = "en_US";
putenv("LANG=" . $language); 
setlocale(LC_ALL, $language);

// Set the text domain as "messages" to 
// use Locale/en_US/LC_MESSAGES/messages.mo
$domain = "messages";
bindtextdomain($domain, "Locale"); 
bind_textdomain_codeset($domain, "UTF-8");

// Use the messages domain
textdomain($domain);

echo _("HELLO_WORLD");

Calling putenv() and setting the LANG environment variable instructs gettext which locale it will be using for this session. en_US is the identifier for English as used in the United States. The first part of the locale is a two-letter lowercase abbreviation for the language according to the ISO 639-1 specification, and the second part is a two-letter uppercase country code according to the ISO 3166-1 alpha-2 specification. setlocale() specifies the locale used in the application and affects how PHP sorts strings, understands date and time formatting, and formats numeric values.

gettext calls the catalog file used to store the translation messages (the MO file) a domain. The bindtextdomain() function tells gettext where to find the domain to use; the first parameter is the catalog name without the .mo extension, and the second parameter is the path to the parent directory in which the en_US/LC_MESSAGES subpath resides (which in turn is where the translation file resides). If you’re wondering where the subpath en_US/LC_MESSAGES comes from, it is constructed by gettext using the values of the LANG variable you specified using putenv() and the locale category LC_MESSAGES. You can call bindtextdomain() several times to bind as many domains as you want, in the event you’ve split your translations up throughout multiple files.

Calling bind_textdomain_codeset() is very important because not doing so can lead to unexpected characters in your output when using non-ASCII letters. Since the catalog messages are encoded in UTF-8, that is what the example code sets as the codeset. I always recommend using UTF-8 as is the most widely supported Unicode encoding. Don’t use other less-known encodings unless you know exactly what you are doing; you will encounter serious problems, especially on the web.

The call textdomain() tells gettext which domain to use for any subsequent calls to gettext(), or its shorthand alias _(), or its plural form lookup method ngettext(). I’ll talk about dealing with plural forms in the next installment, but for now you should know that all three of these methods lookup messages in the current domain specified with textdomain().

Lastly, the script calls _(), which looks up the msgid HELLO_WORLD in the messages.mo file and returns the msgstr associated with it, the text Hello World!

Missing Translation Strings

Now that you have a basic understanding of how this simple script looks up replacements for translations, try changing the domain.

<?php
$language = "en_US";
putenv("LANG=" . $language); 
setlocale(LC_ALL, $language);

$domain = "foo";
bindtextdomain($domain, "Locale"); 
bind_textdomain_codeset($domain, "UTF-8");
// ...

gettext will try to look up the catalog Locale/en_US/LC_MESSAGES/foo.mo, which shouldn’t exist.

When you view the script’s output you’ll see HELLO_WORLD instead of the Hello World! gettext can’t perform a translation because there isn’t a valid catalog, though another scenario might be the given msgid might not exist in any catalogs registered with gettext, and it is smart enough to use the original string you supplied.

Targeting Multiple Locales

In a real-world application, you will typically use your target language’s strings as the IDs throughout your code. This makes the code a bit clearer and the fallback of a translation failure more user friendly. For example, if your application uses English and French as the target languages, you can use English as the ID strings and then create French catalogs to replace the English.

In the same TestI18N/Locale directory, create a new directory named fr_FR containing another LC_MESSAGES directory, and use the procedures outlined in Part 1 to create a new catalog for French. When you’re finished, you should have the following hierarchy:

en_US and fr_FR directories

When you specify the catalog settings in Poedit, remember to set French as the language and France as the country.

Poedit settings window for French

My French messages.po will look like this when opened in a text editor:

msgid ""
msgstr ""
"Project-Id-Version: TestProjectn"
"POT-Creation-Date: n"
"PO-Revision-Date: n"
"Last-Translator: FIRSTNAME LASTNAME <email@example.com>n"
"Language-Team: MyTeam <team@example.com>n"
"MIME-Version: 1.0n"
"Content-Type: text/plain; charset=utf-8n"
"Content-Transfer-Encoding: 8bitn"
"X-Poedit-Language: Frenchn"
"X-Poedit-Country: FRANCEn"
"X-Poedit-SourceCharset: utf-8n"

#Test token 1
msgid "HELLO_WORLD"
msgstr "Bonjour tout le monde!"

#Test token 2
msgid "TEST_TRANSLATION"
msgstr "Test de traduction..."

Most of the header lines of the file are self explanatory, so I’ll skip right to the actual translation lines which start with the first msgid after the headers. Notice that there are two strings for each phrase to be translate, the msgid which is the ID string in your code gettext will look up, and the msgstr which is the translated message which gettext will substitute for the ID. The first definition instructs gettext to use Bonjour tout le monde! whenever it sees HELLO_WORLD. The second instructs gettext to use Test de traduction… for TEST_TRANSLATION.

Open the catalog file again in Poedit and click the Save Catalog entry in the icon bar to save and compile it. Then modify the PHP script to use fr_FR instead of en_US. When you run it, you’ll see the output in your browser is now French!

Summary

In this part you learned what each function call does in the Hello World script introduced in Part 1. In terms of its API, gettext isn’t really a large library. There are only a handful of functions, most of which you will only use once in your entire application. The most frequently used will be gettext(), or it’s shorthand alias _(), and its plural form equivalent ngettext(). You also learned how to target multiple Locales (en_US and fr_FR in our example), and how gettext falls back to the msgid when its missing a translation.

In the next part you’ll see how to start doing real world localization by organizing the directories, switching between languages, choosing a fallback language, and overriding the current selected messages domain.

Image via sgame / Shutterstock

Localizing PHP Applications "The Right Way"

Localizing PHP Applications “The Right Way”, Part 3 >>Localizing PHP Applications “The Right Way”, Part 4 >>

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • David Runion

    I found the gettext workflow to be inefficient and opted instead to go for a fully custom solution. My solution encapsulates translatable bits of text in a translation function like _(“translate me”) just like gettext does, but enables translating directly on the website by storing translations in a database.

    I’m considering turning the database table into static translation files now (probably much faster) but I definitely think editing static files is a step back because I require others to do the translation, and I don’t want to (a) give them access to modify files directly on the website, and (b) make myself the middle man, updating files that are sent to me.

  • dplehati

    Please check the official PHP documentation because your tutorial is misleading and doesn’t mention this:

    http://php.net/manual/en/function.setlocale.php
    The locale information is maintained per process, not per thread. If you are running PHP on a multithreaded server API like IIS or Apache on Windows, you may experience sudden changes in locale settings while a script is running, though the script itself never called setlocale(). This happens due to other scripts running in different threads of the same process at the same time, changing the process-wide locale using setlocale().

    • Abdullah Abouzekry

      @David, can you share your experiences with us, why did you find it inefficient? The same question to @Jiri as well? Thanks for your comments.

      @dplehati, thanks for your comment. You are right about me not pointing out that detail, but I don’t think it’s misleading as you will not get such an issue running FastCGI which is now becoming the de-facto standard for PHP hosting, even for IIS on Windows, please check http://php.iis.net/ for more info.

  • Jiri Fornous

    Hi David,
    we have also found gettext very inefficient solution. And also went with db solution. We use sqlite rather than main db because of easy version control integration and use custom parser for translated text. To serve content faster we cache whole translation db in apc hash.

  • http://www.do-my-site.net/ Peter

    That may work for the html input side but it does nothing for the database side. Secondly, what happens if the browser is set for a foreign language but the user actually uses a different language?
    Good to see the article because most English speaking programmers simply assume that English is everything.

    • David Runion

      Hi Peter, we actually use different top-level domains for each of the countries that we have translations for. This is a costly and inefficient solution (and many countries require that you be a resident of the country in order to purchase a TLD there, although sometimes you can get around this for additional cost with a “local presence service”) but it’s what we do. Each website is translated by the local sales partner for our company. Based on the domain, we use different [analytics codes, language settings, custom CSS file, different contact information, etc] and the partner in that country (who is essentially a sales rep and often non-technical) is responsible for translation.

      Some countries like Switzerland support multiple languages, so we have a language switching feature on each page.

      When we added Spain, we started with the existing Spanish translation provided by our Mexican partner and they changed here and there where needed. Similar for Swiss French and Swiss German.

      For Hebrew, I added a few dir=rtl attributes here and text-direction: rtl styles there and, in a custom CSS file, I changed text-align: left to text-align: right in a bunch of places and fixed padding on containers where I only added padding-left and no corresponding padding-right.

      Japanese wasn’t too hard, but the text is more compact so I changed font sizes and added padding here and there to compensate.

      I would call it a workable solution…