Stopping, Character n-grams, non-English Languages
CS267
Chris Pollett
Oct. 1, 2012
Outline
Finish up Stemming
Stopping
Quiz
Character `n`-grams
non-English Languages
Finishing up Stemming
Before my trip, we had started to talk about stemming as something we might want to do during the process
of finding the terms to index in a document.
We mentioned that for English, one of the most famous stemmers is called the Porter stemmer due to Martin Porter.
The above give an example of text stemmed with this stemmer.
As you can see, it operates only on suffixes and tends to be slightly aggressive. For instance, it would stem "orienteering" and
"orienteers" both down to "orient", conflating the word orient with these terms.
The stemmer also doesn't handle irregular forms well. For example, "ran" stems to "ran" rather than "run". "mouse" stems to "mous"; whereas, "mice" stems to "mice".
The Porter Stemmer operates by applying a sequence of rewrites against the current suffix stem. For example,
sses -> ss
ies -> i
ss -> ss
s ->
All the rules are applied in sequence, if there was a change they are reapplied until no change. The result is the stemmed term.
As a rule of thumb, stemming tends to improve recall, but reduce precision.
Stopping
Function words are words that have no well-defined meanings in and of themselves; rather, they modify words or indicate grammatical relationships.
In English, these include words which are prepositions, articles, pronouns, and conjunctions.
Function words are usually among the most frequently occurring words in any language.
When documents are viewed as unstructured "bags of words", as in the vector space model, the inclusion of function words may be unnecessary.
Even in the proximity model, the close proximity of a particular function word to other query terms may convey very little information.
For these reasons, many IR systems define stopwords, which are stripped from the query before doing an index look-up. These stopwords often include function words.
Stopwords might also include single terms, digits, and other common terms, such as state-of-being verbs.
Eliminating these terms from the index also has the side benefit that it will often speed up query response times as the index size is reduced.
Unfortunately, eliminating these stop words can make queries such as "to be or not to be" impossible.
Quiz
Which of the following is true?
For our Boolean document retrieval algorithms the notion of candidate solution and an actual solution are the same.
Query response time is the average number of queries processed in a given amount of time.
Lemmatization is roughly the process whereby a term is reduced to a word in the sense of a dictionary entry.
Characters
Tokenizing raw text requires an understanding of the characters it encodes.
So far we have assumed we were using ASCII, which is adequate for English, but is inadequate for most of the world's languages.
On the web one might experience documents in a variety of encodings.
For particular languages or regions that a language is used, there might be multiple competing encodings.
Even for English, pre-Unicode, IBM's format EBCDIC was a strong competitor of ASCII.
Since the 1980s, Unicode has been available and can be used to express text from virtually any well-known human language.
A first step then in processing text is often to convert to Unicode if the text isn't already in Unicode.
Understanding Unicode
Unicode assigns a unique value, called a codepoint, to each character, but does not specify how these values
are represented as raw text.
A codepoint is written in the form U+nnnn where nnn indicates the value of the codepoint in hexadecimal. For example β is represented by the codepoint U+03B2.
UTF-8 is the most popular way to represent these codepoints as raw text. It is backward compatible with ASCII.
UTF-8 represents each codepoint with one to four bytes.
Each character in ASCII is encoded as a single byte in UTF-8 with the same value.
The high-order bits of the first byte in a character in UTF-8 indicate its length. 0 - indicates 1 byte, 110 - indicates two bytes, 1110 - indicates three bytes, 11110 - indicates four bytes.
Each byte of a character other than for single byte characters, indicates if it is the start of an encoding or not. If it begins with a 11 then it is the start of a character, otherwise it will always begin with 10.
For example, the character for eight 八 (bā ) in Chinese is U+516B, in binary 01010001 01101011, which in UTF-8 is encoded as 11100101 10000101 10101011.
Character n-grams
So far the method we have described for translating a sequence of characters into a sequence of tokens for indexing is language specific.
I.e., normalization and whether and how to stem is language specific.
Wouldn't it be great is there was a generic approach that would work for any language?
That is the idea which motivates using character n-grams.
In character n-grams, we any treat overlapping sequences of n characters as tokens.
For example, if n=5, the word "orienteering" would be split into the following 5-grams:
.orie orien rient iente entee nteer teeri eerin ering ring.
A three or fewer letter word would be split into 5-grams by putting a dot at the both ends: .the.
Using n-grams tends to make the index larger and therefor query response time slower.
European Languages
European languages fall into several distinct categories
For example, French, Spanish, Italian, are Romance languages; Dutch and German are Germanic languages; Russian and Polish are Slavic languages.
Finnish and Hungarian belong to a fourth family related to Mongolian; Irish and Gaelic represent a fifth family.
Within a group the rules with which one write down the language, the orthography, is similar.
Each group uses an alphabet with upper and lower-case letters.
Punctuation provides structure.
Unlike English, except in words like naïve, most of these languages allow for diacritical marks on characters.
These marks are often omitted in queries to the search engine even though the text being searched for has them.
Often as part of processing, these marks need either to be removed or the term needs to be double indexed.
Some languages such as German and Dutch allow one to dynamically create compound words. For example, fietswiel for "bicycle wheel".
For these languages a segmenter might need to be used to try to split these into base words.
If you are indexing multiple languages, you also have the issue that stopwords for one language might be less common in another: for example, thé in French might get indexed as the and dropped.
There are stemmers for most of these languages.
CJK(V) Languages
Chinese, Japanese, Korean, and old Vietnamese are called CJK(V) languages.
These languages tend to share orthographic conventions, deriving from their common history, even though they are not members
of a single language family.
A typical Chinese newspaper contains thousands of distinct characters.
In Chinese and Japanese, words are not often separated by spaces, so segmentation is more important for these languages.
Japanese uses three main scripts: two syllabaries and Chinese characters (Kanji).
In China, there are simplified and traditional forms of most characters.
A given Chinese character may be split into radicals (of which there are about 216) where these radicals convey either partial meanings for the characters or how the character traditionally sounded.
Although there are some single character function words in Chinese most words are two or more characters. So a 2-gram approach works well for Chinese.
Repeating the same character twice is common in Chinese and might alter the meaning intended. This can confuse bag of word models.
Each of these languages has one or more standard conventions for transliterating them into a Latin alphabet.
As the latter may be entered into a search engine, so conversion to characters might need to be done.