Finishing up Stemming

We now return from our discussion of Yioop, to our earlier discussion of preprocessing before building an inverted index.
We have already talked about stemming as something we might want to do during the process of finding the terms to index in a document.
We mentioned that for English, one of the most famous stemmers is called the Porter stemmer due to Martin Porter.
The above give an example of text stemmed with this stemmer.
As you can see, it operates only on suffixes and tends to be slightly aggressive. For instance, it would stem "orienteering" and "orienteers" both down to "orient", conflating the word orient with these terms.
The stemmer also doesn't handle irregular forms well. For example, "ran" stems to "ran" rather than "run". "mouse" stems to "mous"; whereas, "mice" stems to "mice".
The Porter Stemmer operates by applying a sequence of rewrites against the current suffix stem. For example,
```
sses -> ss
ies -> i
ss -> ss
s ->
```
All the rules are applied in sequence, if there was a change they are reapplied until no change. The result is the stemmed term.
As a rule of thumb, stemming tends to improve recall, but reduce precision.

Stopping

Function words are words that have no well-defined meanings in and of themselves; rather, they modify words or indicate grammatical relationships.
In English, these include words which are prepositions, articles, pronouns, and conjunctions.
Function words are usually among the most frequently occurring words in any language.
When documents are viewed as unstructured "bags of words", as in the vector space model, the inclusion of function words may be unnecessary.
Even in the proximity model, the close proximity of a particular function word to other query terms may convey very little information.
For these reasons, many IR systems define stopwords, which are stripped from the query before doing an index look-up. These stopwords often include function words.
Stopwords might also include single terms, digits, and other common terms, such as state-of-being verbs.
Eliminating these terms from the index also has the side benefit that it will often speed up query response times as the index size is reduced.
Unfortunately, eliminating these stop words can make queries such as "to be or not to be" impossible.

Characters

Tokenizing raw text requires an understanding of the characters it encodes.
So far we have assumed we were using ASCII, which is adequate for English, but is inadequate for most of the world's languages.
On the web one might experience documents in a variety of encodings.
For particular languages or regions that a language is used, there might be multiple competing encodings.
Even for English, pre-Unicode, IBM's format EBCDIC was a strong competitor of ASCII.
Since the 1980s, Unicode has been available and can be used to express text from virtually any well-known human language.
A first step then in processing text is often to convert to Unicode if the text isn't already in Unicode.

Understanding Unicode

Unicode assigns a unique value, called a codepoint, to each character, but does not specify how these values are represented as raw text.
A codepoint is written in the form U+nnnn where nnn indicates the value of the codepoint in hexadecimal. For example β is represented by the codepoint U+03B2.
UTF-8 is the most popular way to represent these codepoints as raw text. It is backward compatible with ASCII.
UTF-8 represents each codepoint with one to four bytes.
Each character in ASCII is encoded as a single byte in UTF-8 with the same value.
The high-order bits of the first byte in a character in UTF-8 indicate its length. 0 - indicates 1 byte, 110 - indicates two bytes, 1110 - indicates three bytes, 11110 - indicates four bytes.
Each byte of a character other than for single byte characters, indicates if it is the start of an encoding or not. If it begins with a 11 then it is the start of a character, otherwise it will always begin with 10.
For example, the character for eight, 八 (bā ), in Chinese is U+516B, in binary 01010001 01101011, which in UTF-8 is encoded as 11100101 10000101 10101011.

Character n-grams

So far the method we have described for translating a sequence of characters into a sequence of tokens for indexing is language specific.
I.e., normalization and whether and how to stem is language specific.
Wouldn't it be great is there was a generic approach that would work for any language?
That is the idea which motivates using character n-grams.
In character n-grams, we any treat overlapping sequences of n characters as tokens.
For example, if n=5, the word "orienteering" would be split into the following 5-grams:
.orie orien rient iente entee nteer teeri eerin ering ring.
A three or fewer letter word would be split into 5-grams by putting a dot at the both ends: .the.
Using n-grams tends to make the index larger and therefore query response time slower.

European Languages

European languages fall into several distinct categories
For example, French, Spanish, Italian, are Romance languages; Dutch and German are Germanic languages; Russian and Polish are Slavic languages.
Finnish and Hungarian belong to a fourth family related to Mongolian; Irish and Gaelic represent a fifth family.
Within a group the rules with which one write down the language, the orthography, is similar.
Each group uses an alphabet with upper and lower-case letters.
Punctuation provides structure.
Unlike English, except in words like naïve, most of these languages allow for diacritical marks on characters.
These marks are often omitted in queries to the search engine even though the text being searched for has them.
Often as part of processing, these marks need either to be removed or the term needs to be double indexed.
Some languages such as German and Dutch allow one to dynamically create compound words. For example, fietswiel for "bicycle wheel".
For these languages a segmenter might need to be used to try to split these into base words.
If you are indexing multiple languages, you also have the issue that stopwords for one language might be less common in another: for example, thé in French might get indexed as the and dropped.
There are stemmers for most of these languages.

CJK(V) Languages

Chinese, Japanese, Korean, and old Vietnamese are called CJK(V) languages.
These languages tend to share orthographic conventions, deriving from their common history, even though they are not members of a single language family.
A typical Chinese newspaper contains thousands of distinct characters.
In Chinese and Japanese, words are not often separated by spaces, so segmentation is more important for these languages.
Japanese uses three main scripts: two syllabaries and Chinese characters (Kanji).
In China, there are simplified and traditional forms of most characters.
A given Chinese character may be split into radicals (of which there are about 216) where these radicals convey either partial meanings for the characters or how the character traditionally sounded.
Although there are some single character function words in Chinese, most words are two or more characters. So a 2-gram approach works well for Chinese.
Repeating the same character twice is common in Chinese and might alter the meaning intended. This can confuse bag of word models.
Each of these languages has one or more standard conventions for transliterating them into a Latin alphabet.
As the latter may be entered into a search engine, so conversion to characters might need to be done.

Quiz

Which of the following is true?

To use a namespace in PHP we use the using keyword.
The `F_1`-measure from class is the harmonic mean of the recall and precision score.
To define cosine ranking we made use of the notion of a cover for a query.

Inverted Index Intro

We are now going to look at inverted indexes in more detail.
So far we have been assuming that the index completely fits in RAM. We are now going to drop that assumption because it is often the case that an inverted index is too large to economically fit into memory.
We have already mentioned that an inverted index contains two main components: dictionary and the posting lists.
Before we talk about disk-based versions of these we briefly mention that another common component of an inverted index is the document map.
A document map contains, for each document in the index, information such as the document's URL, its length, PageRank, and other data.
We now look at these components in the case of a static inverted index. This is an index built for a never-changing text collection.
The life-cycle of such an index is relatively straightforward: (1) Index construction -- process the data in the collection one token at a time and build the postings lists and dictionary (indexing time); (2) Query Processing -- after the index is built handle queries for documents (query time).

The Dictionary

The dictionary provides a mapping from the set of index terms to the location of their posting lists.
At query time, one of the first operations you do is use the dictionary to find the postings lists of each term in the query.
At indexing time, dictionary lookup is used to quickly find the end of the postings lists of each incoming terms and append them.
A simple ADT for a search engine dictionary might look like:
1. Insert a new entry for term `T`.
2. Find and return the entry for term `T` (if present).
3. Find and return the entries for all terms that start with a given prefix `P`.
Here (1) and (2) often are used during indexing time and (2) and (3) at query time.
(3) is not strictly necessary but allows the engine to support prefix queries. i.e., queries like "inform*" which would match informal, informational, etc.

Dictionary Types

For a typical natural language text collection, the dictionary is relatively small compared to the total index size... In the book's example, an uncompressed index for GOV2 is 0.6% of the total index size, for Shakespeare they get 7% of the total index size.
This makes sense because in a larger collection we will tend to have seen all of the available words. Still, contrary to what you might think the number of distinct terms is not finite, but empirically grows something like `O(\sqrt(n))` where `n` is the number of documents seen.
The dictionary is small enough though that for moderately large indexes it can often still fit in memory.
In which case, the two most common in-memory dictionaries are:
- A sort-based dictionary, in which all terms that appear in the text collection are arranged in a sorted array or in a search tree. Look up operations are realized through tree traversal.
- A hash-based dictionary, in which each index term has a corresponding entry in a hash table. Collisions in the hash table are resolved by means of chaining.

Storing Dictionary Terms

If a sort-based dictionary is being used it is important that all the entries have the same size.
In GOV2 though the longest alpha-numeric sequence is 74,147 bytes long; the average sequence is 9.2 bytes long.
We could imagine truncating or padding all terms to length 20. Then storing sorted 20 bytes terms together with 8 byte integer offsets into the posting lists.
This approach would waste 10.8 bytes on average/term.
Instead, a dictionary-as-a-string approach is often used. Here we have two arrays: a primary sorted array of integer offsets into a secondary array containing the actual terms followed by their posting list offset.
Two adjacent elements in the primary array suffice to tell us the length of a term in the secondary array so we don't even have to store a null at the end of each string.
This scheme saves 10.8 - 4 (for the primary array) = 6.8 bytes over the original scheme.

Sort-based versus Hash-based dictionaries

For most applications a hash-based approach is faster than a sort-based implementation as a binary search and/or tree traversal is avoided.
This is assuming the collision chains are kept relatively small. i.e., Assuming the hash-table grows linearly with the number of terms.
The book gives a table showing that a properly scaled hash-table is about three times faster than a sort-based dictionary.
Unfortunately, the speed advantage is only for single term look-ups. Prefix lookups require linear scans; whereas, if you are using a sort-based approach they can be done via binary search and so done much more efficiently.
For this reason, many engines actually implement both approaches.

Posting Lists

As opposed to dictionaries, posting lists contain the majority of the information stored in an inverted index and tend to be too large to store in memory.
Lists are transferred into memory on an as need basis.
To make the transfer of postings from disk to memory as efficient as possible, each term's posting list should be stored in a contiguous region of the hard-drive.
For single term queries one typically accesses a posting list in a sequential fashion.
On the other hand, for conjunctive queries we want to do things like galloping and binary search.
We will look at posting list operations more on Wednesday.

Char-gramming, Language Processing, Static Inverted Indices

Outline

Finishing up Stemming

Stopping

Characters

Understanding Unicode

Character n-grams

European Languages

CJK(V) Languages

Quiz

Inverted Index Intro

The Dictionary

Dictionary Types

Storing Dictionary Terms

Sort-based versus Hash-based dictionaries

Posting Lists