November 14, 2005

Internationalization (i18n)

Filed under: PHP, Web development — Dimitris Giannitsaros @ 10:42

In order to offer localized versions of an application, there are two things to be taken care for:

  • Internationalization (i18n): this is more development centric and it’s all about designing your application to support translations, foreign character sets, timezones, different number / currency / date formats etc.
  • Localization (L10n): using the mechanism provided by i18n, this includes the actual translation and settings for a new language.

On this article I focus mainly on the translation part of the i18n process, although many other subjects are touched.

There are two approaches for translating strings:

  • Using a function to wrap your strings in your code. I call this the “gettext” approach (also check PHP gettext support). Of course it’s not the only tool that offers this functionality, but it supports many languages and it’s open source.
  • Using constants or variables instead of strings in your code. One or more language files contain the actual strings and these files are included by your application.

Lets see some general facts on these two approaches:

The “gettext” approach:

  • Is more complex.
  • Is suitable for large projects with thousands of strings.
  • Allows strings to be stored in a DataBase or an efficient data structure.
  • May utilize a special translating program.
  • May offer a better management of deprecated and changed strings.
  • Offers better domain management (think of domains as logical areas of the application e.g. the administrator’s area, the user’s area etc)

The “language file” approach:

  • Is significantly simpler.
  • Is better for smaller projects.
  • Stores strings in text files, so editing is easy for anyone.
  • Doesn’t force any good habits upon you, so you have to be careful.

I’ve used both approaches in web and desktop applications (most notably Cheez, a free image cataloguing tool, currently translated in 17 languages). The “language file” approach is my favorite, so what follows is some advice on using this approach:


Language scope

For multiuser applications, you must consider whether all users will have the same language (so language is a system setting) or each user will choose his language of choice (so language is a user setting).

The same rules apply to both cases, but you’ll need to make some design decisions based on that.

Single file vs. many files

It’s better to use a single file instead of many files. If this is becoming a really big file, maybe you should check the “gettext” approach. Exception: if your application supports plug-ins (e.g. user created stuff), make sure you provide a mechanism where each plug-in has its own language file (preferably a single file per plug-in). You don’t want to litter your core language file with strings specific to plug-ins.

For an example of why multiple files can get out of hand check osCommerce, which uses about 80 files per language in many directories. Moreover each plug-in is allowed to have many language files, making things even worse. Note: Other than this inconvenience, osCommerce is a very good and popular shopping cart solution.

Images

Handling localized images can be a bit tricky. You have at least 3 options:

  • All images have the same name, but are placed in different directories (/english/, /greek/, /german/). Your code uses the right directory based on a string included in the language file.
  • Images have different names, using a standard prefix / suffix (icon_delete_ENGLISH.png, icon_delete_GREEK.png).
  • Images can have any name as long as it’s defined in the language file with the rest resource strings. Your code loads images using the appropriate string. Of course it’s a good idea to use a naming convention for images (e.g. a suffix).

Personally I prefer the 3rd option. This way your code doesn’t do anything different than for normal strings, plus translators know what images must be changed just by looking the language file.

Other i18n settings

Be careful what i18n settings go into the language file. This can be a problem especially for web applications, where users can be anywhere in the world.

Many projects I’ve seen put things like the date / number / currency formats and timezone in the language file. It’s much better to have these as user settings: just because a user prefers a specific language e.g. Greek or German, it doesn’t mean he’s currently based in Greece or Germany.

Of course, which settings should go in the language file and which are made available as a user setting depends a lot on what your application does, what kind of users it has etc.


Charset

If your application supports Unicode you probably need only one charset (e.g. UTF-8), so you can skip this paragraph.

Charset is very important for two reasons: a) It allows users to correctly view localized characters. b) It allows users to correctly enter (and store) localized characters. The first thing that comes to mind is to put charset in the language file.

This is usually good enough, but has a small problem: Imagine your web application currently supports English. Inside your language file you’ve set a variable for the charset (e.g. $charset=”ISO-8859-1″) which you use for defining the charset of the html files. Imagine a Greek installs your software and tries to use it. Although he knows English, he would also like to insert data in Greek. Since you have tied the charset (ISO-8859-1) with the interface language (English) he can’t! If he could change the charset to “ISO-8859-7″ he would be able to enter and view Greek text (and of course the English strings of the UI would be displayed correctly).

I am not arguing that it’s always the right thing to offer charset as a user setting. Just remember that the language file’s purpose is to have the localized resource strings, without interfering with the way the application works.

Good resource strings

Some general guidelines for creating good resource strings:

  • Be careful to use complete sentences as resource strings while coding. Concatenation must be kept to a minimum and special language functions must be used to format strings (printf(), sprintf() for PHP).

    So instead of

    $location . " contains " . $count . " files";

    which needs 2 resource strings and the translator doesn’t see this as a complete sentence, use

    sprintf("%s contains %d files", $location, $count);

    which needs 1 resource string and actually makes sense to the translator.

  • If support for argument ordering is available, use it (PHP has it). So the above example would become:

    sprintf("%1$s contains %2$d files";

    which needs 1 resource string and the translator can change the argument order e.g.

    "%2\$d files are contained by %1\$s"

  • Try to keep related sentences in one resource string. “Command failed. Abort or retry?” should be one resource string, not two.
  • Sometimes it’s best to use different resource strings for the same word / phrase. This is hard to get right, because as a developer you don’t know which words have many different meanings in other languages. One solution is to use a different resource string for all strings. So if your application uses the word “Save” 28 times, then you have 28 different resource strings for “Save”. This can be extra work, both for you and the translator, but guarantees a better translation quality level can be achieved.

    The best solution is somewhere in the middle. This way simple words (”yes”, “no”) can be mapped to a single resource string, while more complex words (”execute”) have a separate resource string for each occurrence.


Resource strings naming convention

Obviously it’s a good idea to have a common prefix for resource strings (e.g. lc_). The rest of the name can be either an increasing number or a description:

$lc_res1
$lc_res2
$lc_res3
$lc_res4

or

$lc_yes
$lc_no
$lc_execute1
$lc_execute2

Although the 2nd group seems much clearer, after about 1000 strings it becomes difficult to think of good descriptive names and you end with things like

$lc_warn_user_after_failed_sql_execution_offer_to_retry

Domains

If you want to logically separate the resource strings based on different areas / parts of your application you may be tempted to use multiple files. I believe it’s always better to keep to one file and just use some comments for domain separation. So you can have:

// Admin area

$lc_admin_res1 = “”;
$lc_admin_res2 = “”;

// User area

$lc_user_res1 = “”;
$lc_user_res2 = “”;

Versioning

This is the single most important advice on this article.

After you release your first public version, you must never again change a resource string. Even if you find a typo or something bad you wrote about your boss or wife.

Both new and changed resource strings go at the end of the file. Moreover it’s good to keep a comment about each version:

// Version 1.0

$lc_yes = “Yed”;
$lc_no = “No”;

// Version 2.0

$lc_yes = “Yes”; // correction
$lc_new = “New”

// Version 3.0
(v3.0 resource strings will go here)

Note that the $lc_yes resource string was corrected in the new version, while the old resource string stayed the same.

There are a number of reasons to uphold this policy:

  1. Translators have a much easier job with new versions. They just go to the end of the file to find new / changed strings. Changed strings are marked with “correction”, so they can find the old translation by searching. No need to use tools like diff, to try and find what has changed between versions.
  2. You can have a single file per language that works for all versions. If you translate the v2.0 file, you can send it to someone using v1.0 and he’ll have no problem.
  3. When you release a new version (e.g. v3.0) you may not want to wait for translators to translate the new strings. So you just copy-paste everything under “Version 3.0″ from the original language file to all other languages files and you’re good to go (so for foreign languages, old strings will remain translated while new strings will be untranslated).

Used images are under a CC license. See here:
1st image, 2nd image, 3rd image, 4th image

I was writing an article on PHP and timezones (promised here), but then I changed my mind and decided to write and publish this one first. The one about timezones will be next.

2 Comments

  1. Pure English = on
    Maybe out-of-topic = on

    It is a greek localization prob, having to do with gettext, upcasing, accents etc but not with translation-related issues.

    When you upcase a greek letter it should be considered as a completely different thing than upcasing words alltogether (even one-letter ones).

    e.g:
    Letter η becomes Η (standalone case).
    Letter ή must become Η too (since caps are not accented) (standalone case)
    Letter ή must become Ή (if it is the one-letter greek word meaning OR).
    Letter η in a word cannot be upcased with the same way every time. It has to do with the position of it (first in word, accented or not etc)…

    Same (and worse) goes with other vowels like iota (that can have accents and dialytika too. Sometimes they have to stay, sometimes the have to go…). Try to upcase:
    παιδάκια and παϊδάκια
    to see the prob…

    And the inverse: Try it with s-final (Σ). Should it become, σ or ς ?

    Even it is standalone (and not at the end of a word).
    like: Σ’ ΑΓΑΠΩ

    greek gettext (and based functions like upcasing ones sucks). And that makes ispell, aspell based dictionary checks suck too…

    Comment by evris — November 17, 2005 @ 15:24

  2. Now that i look my published comment i have an urge to make a correction. Change:

    Pure english = ON
    with
    Poor english = DONE

    Comment by evris — November 17, 2005 @ 15:27

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.


Powered by WordPress Theme by H P Nadig