Characters

Tokenizing raw text requires an understanding of the characters it encodes.
So far we have assumed we were using ASCII, which is adequate for English, but is inadequate for most of the world's languages.
On the web one might experience documents in a variety of encodings.
For particular languages or regions that a language is used, there might be multiple competing encodings.
Even for English, pre-Unicode, IBM's format EBCDIC was a strong competitor of ASCII.
Since the 1980s, Unicode has been available and can be used to express text from virtually any well-known human language.
A first step then in processing text is often to convert to Unicode if the text isn't already in Unicode.

Understanding Unicode

Unicode assigns a unique value, called a codepoint, to each character, but does not specify how these values are represented as raw text.
A codepoint is written in the form U+nnnn where nnnn indicates the value of the codepoint in hexadecimal. For example β is represented by the codepoint U+03B2.
UTF-8 is the most popular way to represent these codepoints as raw text. It is backward compatible with ASCII.
UTF-8 represents each codepoint with one to four bytes.
Each character in ASCII is encoded as a single byte in UTF-8 with the same value.
The high-order bits of the first byte in a character in UTF-8 indicate its length. 0 - indicates 1 byte, 110 - indicates two bytes, 1110 - indicates three bytes, 11110 - indicates four bytes.
Each byte of a character other than for single byte characters, indicates if it is the start of an encoding or not. If it begins with a 11 then it is the start of a character, otherwise it will always begin with 10.
For example, the character for eight, 八 (bā ), in Chinese is U+516B, in binary 01010001 01101011, which in UTF-8 is encoded as 11100101 10000101 10101011.

So far the method we have described for translating a sequence of characters into a sequence of tokens for indexing is language specific.
I.e., normalization and whether and how to stem is language specific.
Wouldn't it be great is there was a generic approach that would work for any language?
That is the idea which motivates using character n-grams.
In character n-grams, we treat any overlapping sequences of n characters as tokens.
For example, if n=5, the word "orienteering" would be split into the following 5-grams:
.orie orien rient iente entee nteer teeri eerin ering ring.
A three or fewer letter word would be split into 5-grams by putting a dot at the both ends: .the.
Using n-grams tends to make the index larger and therefore query response time slower.

European languages fall into several distinct categories
For example, French, Spanish, Italian, are Romance languages; Dutch and German are Germanic languages; Russian and Polish are Slavic languages.
Finnish and Hungarian belong to a fourth family related to Mongolian; Irish and Gaelic represent a fifth family.
Within a group the rules with which one write down the language, the orthography, is similar.
Each group uses an alphabet with upper and lower-case letters.
Punctuation provides structure.
Unlike English, except in words like naïve, most of these languages allow for diacritical marks on characters.
These marks are often omitted in queries to the search engine even though the text being searched for has them.
Often as part of processing, these marks need either to be removed or the term needs to be double indexed.
Some languages such as German and Dutch allow one to dynamically create compound words. For example, fietswiel for "bicycle wheel".
For these languages a segmenter might need to be used to try to split these into base words.
If you are indexing multiple languages, you also have the issue that stopwords for one language might be less common in another: for example, thé in French might get indexed as the and dropped.
There are stemmers for most of these languages.

Chinese, Japanese, Korean, and old Vietnamese are called CJK(V) languages.
These languages tend to share orthographic conventions, deriving from their common history, even though they are not members of a single language family.
A typical Chinese newspaper contains thousands of distinct characters.
In Chinese and Japanese, words are not often separated by spaces, so segmentation is more important for these languages.
Japanese uses three main scripts: two syllabaries and Chinese characters (Kanji).
In China, there are simplified and traditional forms of most characters.
A given Chinese character may be split into radicals (of which there are about 216) where these radicals convey either partial meanings for the characters or how the character traditionally sounded.
Although there are some single character function words in Chinese, most words are two or more characters. So a 2-gram approach works well for Chinese.
Repeating the same character twice is common in Chinese and might alter the meaning intended. This can confuse bag of word models.
Each of these languages has one or more standard conventions for transliterating them into a Latin alphabet.
As the latter may be entered into a search engine, so conversion to characters might need to be done.

Install composer.
Create a composer project.
Write a short program, test.php in your composer folder that uses the hyphenateEntities($phrase, $locale) static method in PhraseParser to determine the entities in the phrase: "For many years after their introduction, vacuum cleaners remained a luxury item, but after the Second World War they became common among the middle classes". Your program should then print this
Your program should then run the result through a stemmer and output the result of this.
Post your solution to the Mar 10 In-class Exercise Thread.