i18n and L10n

CS174

Chris Pollett

Nov. 10, 2010

Outline

About two-thirds of people online don't speak English. [Building Scalable Web-sites.]
Thus, it often makes sense to design one's web-site to reach this audience.
Today, we will look at what's involved to do this.

Two commonly used terms with regard to making your web-site work in other languages are:
- Internationalization -- adding to an application the ability to input, process, and output international text.
- Localization -- making a customized application available to a specific locale.
These are often abbreviated as i18n (number of char's betwen i and n in internationalization) and L10n (or l10n).
A locale is a set of localization preferences which can include language, region, time zone, time and date formats, number display (period rather than comma, 1000 versus 100 to split a long number), and currency.
For a web application, the locale that should be used with a given user is often stored in a session variable.
For a web application, i18n often involves making sure your program can operate correctly on different character sets.
Localization often involves making sure that correct translations of web pages appear so as to customize the appearance of the web-site to a given locale.

Unicode is the industry way to deal with texts from multiple languages.
Unicode development started in 1987 as a project of Joe Becker (Xerox), Lee Collins, and Mark Davis (Apple).
The original encoding was 16 bits with an aim. Its aim was to be able to encode texts in all modern languages.
Currently, there are six different variants of Unicode, the most popular of which are UTF-8, UTF-16, and UTF-32.

A pattern of pixels on the screen representing an agreed on letter shape is called a glyph.
For example, the letter "a" might have more than one glyph: a, a.
When a sequence of characters is stored digitally we usually only store a representation of the sequence of characters, not the actual glyphs. We use things like css to specify the font-family to say for a given character what glyph to actually draw.
The process of specifying a character involves two mappings:
- A character set which is a mapping from abstract "characters" to numbers called code points.
- Then we have an encoding which maps these numbers into a binary representation that can be stored in the computer.
For English, one common character set and encoding is ASCII.
In ASCII, the code point for an "a" is 0x61. It's encoding as a byte would also be 0x61 as a sequence of bits.
In Unicode, the code point for "a" is 0x61; however, in UTF-16 its encoding would be two bytes 0x0061.
In Unicode, code points are often writen as U+ a sequence of hex digits.
In UTF-8 most roman letters are encoded as single bytes which is why UTF-8 is the most popular Unicode encoding.
UTF-32 is a fixed-width encoding -- all code points are encoded with the same number of bits.
UTF-8 and UTF-16 are variable-width encodings -- some characters are encoded with more bits than others.

A character in Unicode does not necessarily map to what a person thinks of as a character.
For instance, ã can be represented as one or two code points either U+00E3 or as an "a" U+0061 followed by a combining tilde U+0303.
The composed form is called a grapheme.
You can also have ligatures for some pairs of characters. So for example ﬁ is a single code point U+FB01 but represents two characters - f and i.
Where this causes headaches is how do you measure the length of a Unicode string?
You can't count bytes or code points -- you have to understand what kind of code point it is (each code point belongs to a general category such as Lu: Letter,uppercase or Letter, LM Letter Modifier) and where the code point lies.
In general, if you are working with Unicode you want to try use your languages facility for working with Unicode to get these kind of computations to happen magically.

A byte order mark is a sequence of bytes at the beginning of a Unicode stream used to designate the encoding type.
For example,
- UTF-16 big endian FE FF
- UTF-16 little endian FF FE
- UTF-32 big endian 00 00 FE FF
- UTF-32 little endian 00 00 FF FE
- UTF-8 little endian EF BB BF
Some document editors store Unicode with this BOM character; however, this BOM character can also confuse browsers so it should be gotten rid of and avoided for HTML, XML, and PHP documents.
You should instead use the Content-Type header to specify the character encoding.

So how does UTF-8 work?
It uses bits in each byte to signal the number of bytes in a code point.
For example,
- A 1 byte code point is signaled by the lead bit being 0.
- A 2 byte code point is signaled by the lead bits of the first byte being 110 and the lead bytes of the next byte being 10.
- Similar techniques are used for code point that take more bytes, the maximum number of bytes being 8.

In HTTP to say that a file is UTF-8, you can use the header:
```
Content-Type: text/html; charset=utf-8
```
In Apache you can make it so this header is automatically served for all files of a given extension by using a line like:
```
AddCharset UTF-8 .php
```
For static HTML where you don't have control of the server, you can still signal the browser on the charset using:
```
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >
```
When using string manipulation functions in PHP, if you want it to work with UTF-8 you need to use the multi-byte version of the given function.
For example, use mb_substr() rather than substr().
MySQL supports UTF-8 since version 4.1, for older versions you would need to use BLOB or BINARY rather than CHAR, VARCHAR, or TEXT.
MySQL only supports UTF-8 codes up to 3 bytes.
For Javascript, do not use escape() which doesn't work with UTF-8. You should write your own processing function to encode stuff. You can use the function String.getCodeAt() to find out UTF-8 codepoints.