The New Horror of L10n and i18n - for web designers (maybe also for web developers)

The New Horror of L10n and i18n - for web designers (maybe also for web developers)

So, now one can have domains written in a language and alphabet with characters other than latin (not english, as many would say).

But, the question is, how can you manually write a html document in another language and alphabet than your faitfull Latin?

The basics

First, you need to write a html document. You do that usually using a text editor. For characters from another language you need L10n: localization of your keyboard.

Then you open a simple text editor, not a Rich Text Editor or a Word Processor. You write using localized keyboard characters, guessing at first how it’s changed layout (unless you already have a localized hardware keyboard).

You finish writing your html document, save it. Open it again and, maybe, instead of localized characters (what you are expecting) now you have question marks (anoying isnt’it). What did go wrong?

You also need i18n: internationalization. That is, choose a character set (an encoding) that has in it’s “list” of characters the ones you want.

The bigger story

When you write at your keyboard, the character stream you see display is, in fact, a stream of bytes for your computer. When this stream of bytes is saved in a file it is stored using an encoding for a character set. Usually, if you don’t choose it upon save or during your typing, that encoding is ANSI. That means, that even if you obtained the display of a localized character (by typing it), it may not be saved properly because it’s not in the “list” of that character set used for saving it (it’s not the right street, what you’re looking for does not “stay” there).

What is ANSI? It’s the ASCII’s bigger but little brother. What is ASCII? A set of 128 symbols: 0-9, a-z, A-Z, some other special symbols like colon, coma etc (like numbered houses in a specific street with different families living in them). ASCII is in fact ASCIIs as there are different sets of it called extensions (that means that Character Town has more than one street called ASCII). What this means? It’s like different tables of 128 characters that have in common a subset of characters (0-9 a-z, A-Z etc) (like many IDENTICALLY twin families that are living on different streets), but different on some parts, having “written” in the rest of the “list”, characters specific to other languages’s alphabets (like those twin families having each different neighbours). Why 128? They used it first for telegraphic seven-bit teletransmiter codes.

Latter comes ANSI to take full advantage of those eight-bit byte. So we have now a set, a “list” of 256 characters. In fact, like ASCII, we have ANSIs, extensions for it (bigger streets now in Character City, more houses, more families). Why? Well, there are too many languages with lots of characters. One set of 256 could not be used to store them all.

Now, one asks: why am I limited to 256? Why don’t I write two-bytes, or more-bytes letter or symbols and put them all in one set (one big populated highway, with houses that can have more than one family in them)? OK, but wait, how do I separate bytes making a character or another (remember, symbols are in fact bytes for my computer). Well, lets make rules how to parse streams of characters into streams of bytes and viceversa (like two or more family members can be siamese, they only come togheter).

And they make it: the set is called Universal Character Set (UCS) and the rule for it is called Unicode Transformation Format (UTF).

Decisive step

When you write a html document that uses characters other than Latin, upon saving/writing that file maybe choose UTF-8 encoding (an almost standard encoding that people have grown fond of it) instead of ANSI.

Final chapter: &gamma OR (& #947 OR & #x3B3) OR γ

When writing html documents you can be helped by user agent. His going to work extra for you if, let’s say, lazzy you don’t want to localized your keyboard just to write down greek letter gamma. You can reference that letter in two way: character entities and numeric reference (which can be also done two ways: decimal or hexa).

Even thow this can be helpfull, you can’t rely on character entities and numeric reference when writing fully localized html documents, you need to start localizing your keyboard and learn and use the new layout for producing localized content.

The HOROR :slight_smile:

One thing I’ve discovered thow, this can make room for funny hibrids. In an article here, on sitepoint, specifically about i18n,

if you look at the source of the html page, you’ll see a mixed use of techniques for writing and referencing symbols. Althou arab characters describing links for domanins using letters other than Latin characters are writen as they should, with a localized keyboard (or by copy-paste :slight_smile: ), some ASCII characters, like in the alt text (that little tooltip apearing when hover over a link) for WEB TECH (besides author name) are written using character entities.

Also, a lot of content has it& #8217 ;s for it’s, that& #8217 ;s, and Ok& #8230; for Ok…

For me that’s funny. What say you?