Outline
- Unicode
- Char-grams
- Language Specific Processing
- In-Class Exercise
Characters
- Tokenizing raw text requires an understanding of the characters it encodes.
- So far we have assumed we were using ASCII, which is adequate for English, but is inadequate for most of the world's languages.
- On the web one might experience documents in a variety of encodings.
- For particular languages or regions that a language is used, there might be multiple competing encodings.
- Even for English, pre-Unicode, IBM's format EBCDIC was a strong competitor of ASCII.
- Since the 1980s, Unicode has been available and can be used to express text from virtually any well-known human language.
- A first step then in processing text is often to convert to Unicode if the text isn't already in Unicode.
Understanding Unicode
- Unicode assigns a unique value, called a codepoint, to each character, but does not specify how these values
are represented as raw text.
- A codepoint is written in the form U+nnnn where nnnn indicates the value of the codepoint in hexadecimal. For example β is represented by the codepoint U+03B2.
- UTF-8 is the most popular way to represent these codepoints as raw text. It is backward compatible with ASCII.
- UTF-8 represents each codepoint with one to four bytes.
- Each character in ASCII is encoded as a single byte in UTF-8 with the same value.
- The high-order bits of the first byte in a character in UTF-8 indicate its length. 0 - indicates 1 byte, 110 - indicates two bytes, 1110 - indicates three bytes, 11110 - indicates four bytes.
- Each byte of a character other than for single byte characters, indicates if it is the start of an encoding or not. If it begins with a 11 then it is the start of a character, otherwise it will always begin with 10.
- For example, the character for eight, 八 (bā ), in Chinese is U+516B, in binary 01010001 01101011, which in UTF-8 is encoded as 11100101 10000101 10101011.
Character n-grams
- So far the method we have described for translating a sequence of characters into a sequence of tokens for indexing is language specific.
- I.e., normalization and whether and how to stem is language specific.
- Wouldn't it be great is there was a generic approach that would work for any language?
- That is the idea which motivates using character n-grams.
- In character n-grams, we treat any overlapping sequences of n characters as tokens.
- For example, if n=5, the word "orienteering" would be split into the following 5-grams:
.orie orien rient iente entee nteer teeri eerin ering ring.
- A three or fewer letter word would be split into 5-grams by putting a dot at the both ends: .the.
- Using n-grams tends to make the index larger and therefore query response time slower.
European Languages
- European languages fall into several distinct categories
- For example, French, Spanish, Italian, are Romance languages; Dutch and German are Germanic languages; Russian and Polish are Slavic languages.
- Finnish and Hungarian belong to a fourth family related to Mongolian; Irish and Gaelic represent a fifth family.
- Within a group the rules with which one write down the language, the orthography, is similar.
- Each group uses an alphabet with upper and lower-case letters.
- Punctuation provides structure.
- Unlike English, except in words like naïve, most of these languages allow for diacritical marks on characters.
- These marks are often omitted in queries to the search engine even though the text being searched for has them.
- Often as part of processing, these marks need either to be removed or the term needs to be double indexed.
- Some languages such as German and Dutch allow one to dynamically create compound words. For example, fietswiel for "bicycle wheel".
- For these languages a segmenter might need to be used to try to split these into base words.
- If you are indexing multiple languages, you also have the issue that stopwords for one language might be less common in another: for example, thé in French might get indexed as the and dropped.
- There are stemmers for most of these languages.
CJK(V) Languages
- Chinese, Japanese, Korean, and old Vietnamese are called CJK(V) languages.
- These languages tend to share orthographic conventions, deriving from their common history, even though they are not members
of a single language family.
- A typical Chinese newspaper contains thousands of distinct characters.
- In Chinese and Japanese, words are not often separated by spaces, so segmentation is more important for these languages.
- Japanese uses three main scripts: two syllabaries and Chinese characters (Kanji).
- In China, there are simplified and traditional forms of most characters.
- A given Chinese character may be split into radicals (of which there are about 216) where these radicals convey either partial meanings for the characters or how the character traditionally sounded.
- Although there are some single character function words in Chinese, most words are two or more characters. So a 2-gram approach works well for Chinese.
- Repeating the same character twice is common in Chinese and might alter the meaning intended. This can confuse bag of word models.
- Each of these languages has one or more standard conventions for transliterating them into a Latin alphabet.
- As the latter may be entered into a search engine, so conversion to characters might need to be done.
In-Class Exercise
- Install composer.
- Create a composer project.
- Write a short program, test.php in your composer folder that uses the hyphenateEntities($phrase, $locale) static method in PhraseParser to determine the entities in the phrase: "For many years after their introduction, vacuum cleaners remained a luxury item, but after the Second World War they became common among the middle classes". Your program should then print this
- Your program should then run the result through a stemmer and output the result of this.
- Post your solution to the Mar 10 In-class Exercise Thread.