Char-gramming, Language Processing, Static Inverted Indices
CS267
Chris Pollett
Sep 25, 2019
Outline
Finish Stemming, Stopping
Char-grams
Language Specific Processing
In-Class Exercise
Index Components
The Dictionary
Posting Lists
Finishing up Stemming
We now return from our discussion of Yioop, to our earlier discussion of preprocessing before building an inverted index.
We have already talked about stemming as something we might want to do during the process
of finding the terms to index in a document.
We mentioned that for English, one of the most famous stemmers is called the Porter stemmer due to Martin Porter.
The above give an example of text stemmed with this stemmer.
As you can see, it operates only on suffixes and tends to be slightly aggressive. For instance, it would stem "orienteering" and
"orienteers" both down to "orient", conflating the word orient with these terms.
The stemmer also doesn't handle irregular forms well. For example, "ran" stems to "ran" rather than "run". "mouse" stems to "mous"; whereas, "mice" stems to "mice".
The Porter Stemmer operates by applying a sequence of rewrites against the current suffix stem. For example,
sses -> ss
ies -> i
ss -> ss
s ->
All the rules are applied in sequence, if there was a change they are reapplied until no change. The result is the stemmed term.
As a rule of thumb, stemming tends to improve recall, but reduce precision.
Stopping
Function words are words that have no well-defined meanings in and of themselves; rather, they modify words or indicate grammatical relationships.
In English, these include words which are prepositions, articles, pronouns, and conjunctions.
Function words are usually among the most frequently occurring words in any language.
When documents are viewed as unstructured "bags of words", as in the vector space model, the inclusion of function words may be unnecessary.
Even in the proximity model, the close proximity of a particular function word to other query terms may convey very little information.
For these reasons, many IR systems define stopwords, which are stripped from the query before doing an index look-up. These stopwords often include function words.
Stopwords might also include single terms, digits, and other common terms, such as state-of-being verbs.
Eliminating these terms from the index also has the side benefit that it will often speed up query response times as the index size is reduced.
Unfortunately, eliminating these stop words can make queries such as "to be or not to be" impossible.
Characters
Tokenizing raw text requires an understanding of the characters it encodes.
So far we have assumed we were using ASCII, which is adequate for English, but is inadequate for most of the world's languages.
On the web one might experience documents in a variety of encodings.
For particular languages or regions that a language is used, there might be multiple competing encodings.
Even for English, pre-Unicode, IBM's format EBCDIC was a strong competitor of ASCII.
Since the 1980s, Unicode has been available and can be used to express text from virtually any well-known human language.
A first step then in processing text is often to convert to Unicode if the text isn't already in Unicode.
Understanding Unicode
Unicode assigns a unique value, called a codepoint, to each character, but does not specify how these values
are represented as raw text.
A codepoint is written in the form U+nnnn where nnnn indicates the value of the codepoint in hexadecimal. For example β is represented by the codepoint U+03B2.
UTF-8 is the most popular way to represent these codepoints as raw text. It is backward compatible with ASCII.
UTF-8 represents each codepoint with one to four bytes.
Each character in ASCII is encoded as a single byte in UTF-8 with the same value.
The high-order bits of the first byte in a character in UTF-8 indicate its length. 0 - indicates 1 byte, 110 - indicates two bytes, 1110 - indicates three bytes, 11110 - indicates four bytes.
Each byte of a character other than for single byte characters, indicates if it is the start of an encoding or not. If it begins with a 11 then it is the start of a character, otherwise it will always begin with 10.
For example, the character for eight, 八 (bā ), in Chinese is U+516B, in binary 01010001 01101011, which in UTF-8 is encoded as 11100101 10000101 10101011.
Character n-grams
So far the method we have described for translating a sequence of characters into a sequence of tokens for indexing is language specific.
I.e., normalization and whether and how to stem is language specific.
Wouldn't it be great is there was a generic approach that would work for any language?
That is the idea which motivates using character n-grams.
In character n-grams, we any treat overlapping sequences of n characters as tokens.
For example, if n=5, the word "orienteering" would be split into the following 5-grams:
.orie orien rient iente entee nteer teeri eerin ering ring.
A three or fewer letter word would be split into 5-grams by putting a dot at the both ends: .the.
Using n-grams tends to make the index larger and therefore query response time slower.
European Languages
European languages fall into several distinct categories
For example, French, Spanish, Italian, are Romance languages; Dutch and German are Germanic languages; Russian and Polish are Slavic languages.
Finnish and Hungarian belong to a fourth family related to Mongolian; Irish and Gaelic represent a fifth family.
Within a group the rules with which one write down the language, the orthography, is similar.
Each group uses an alphabet with upper and lower-case letters.
Punctuation provides structure.
Unlike English, except in words like naïve, most of these languages allow for diacritical marks on characters.
These marks are often omitted in queries to the search engine even though the text being searched for has them.
Often as part of processing, these marks need either to be removed or the term needs to be double indexed.
Some languages such as German and Dutch allow one to dynamically create compound words. For example, fietswiel for "bicycle wheel".
For these languages a segmenter might need to be used to try to split these into base words.
If you are indexing multiple languages, you also have the issue that stopwords for one language might be less common in another: for example, thé in French might get indexed as the and dropped.
There are stemmers for most of these languages.
CJK(V) Languages
Chinese, Japanese, Korean, and old Vietnamese are called CJK(V) languages.
These languages tend to share orthographic conventions, deriving from their common history, even though they are not members
of a single language family.
A typical Chinese newspaper contains thousands of distinct characters.
In Chinese and Japanese, words are not often separated by spaces, so segmentation is more important for these languages.
Japanese uses three main scripts: two syllabaries and Chinese characters (Kanji).
In China, there are simplified and traditional forms of most characters.
A given Chinese character may be split into radicals (of which there are about 216) where these radicals convey either partial meanings for the characters or how the character traditionally sounded.
Although there are some single character function words in Chinese, most words are two or more characters. So a 2-gram approach works well for Chinese.
Repeating the same character twice is common in Chinese and might alter the meaning intended. This can confuse bag of word models.
Each of these languages has one or more standard conventions for transliterating them into a Latin alphabet.
As the latter may be entered into a search engine, so conversion to characters might need to be done.
In-Class Exercise
Install composer.
Create a composer project.
Write a short program, test.php in your composer folder that uses the hyphenateEntities($phrase, $locale) static method in PhraseParser to determine the entities in the phrase: "For many years after their introduction, vacuum cleaners remained a luxury item, but after the Second World War they became common among the middle classes". Your program should then print this
Your program should then run the result through a stemmer and output the result of this.
We are now going to look at inverted indexes in more detail.
So far we have been assuming that the index completely fits in RAM. We are now going to drop that assumption because it is often the case that an inverted index is too large to economically fit into memory.
We have already mentioned that an inverted index contains two main components: dictionary and the posting lists.
Before we talk about disk-based versions of these we briefly mention that another common component of an inverted index is the document map.
A document map contains, for each document in the index, information such as the document's URL, its length, PageRank, and other data.
We now look at these components in the case of a static inverted index. This is an index built for a never-changing text collection.
The life-cycle of such an index is relatively straightforward: (1) Index construction -- process the data in the collection one token at a time and build the postings lists and dictionary (indexing time); (2) Query Processing -- after the index is built handle queries for documents (query time).
The Dictionary
The dictionary provides a mapping from the set of index terms to the location of their posting lists.
At query time, one of the first operations you do is use the dictionary to find the postings lists of each term in the query.
At indexing time, dictionary lookup is used to quickly find the end of the postings lists of each incoming terms and append them.
A simple ADT for a search engine dictionary might look like:
Insert a new entry for term `T`.
Find and return the entry for term `T` (if present).
Find and return the entries for all terms that start with a given prefix `P`.
Here (1) and (2) often are used during indexing time and (2) and (3) at query time.
(3) is not strictly necessary but allows the engine to support prefix queries. i.e., queries like "inform*" which would match informal, informational, etc.