Implementing the Inverted Index ADT

CS267

Chris Pollett

Sep. 10, 2012

Outline

Analysing Phrase Search
Quiz
Implementing the Inverted Index ADT

ADT Example: Phrase Search

We have presented an ADT for inverted indexes that had the operations: first(t), last(t), next(t, current), prev(t, current).

We then gave the following algorithm using this ADT for finding exact phrase matches in a document collection:

nextPhrase(t[1],t[2], .., t[n], position)
{
   v:=position
   for i = 1 to n do
     v:= next(t[i], v)
   if v == infty then // infty represents after the end of the posting list
      return [infty, infty]
   u := v
   for i := n-1 downto 1 do
     u := prev(t[i],u)
   if(v-u == n - 1) then
     return [u, v]
   else 
     return nextPhrase(t[1],t[2], .., t[n], u) 
}

Today, we are going to use this example to motivate various different ways one could try to implement the four operations of our ADT.

Generating all Occurrences.

Usually when we search the web, we don't only want to generate the first phrase match, but rather all (at least the first ten) occurrences of the phrase in the collection.

To do this, we merely have to put the code of the last slide into a loop:

u := -infty
while u < infty do
    [u, v] := nextPhrase(t[1],t[2], .., t[n], u) 
    if( u != infty) then
        report the interval [u, v]

Each call to nextPhrase makes `O(n)` calls to our low level ADT (in particular, next).
Let `l_(t_i)` denote the length of `t_i`'s posting list. Let `l = min_(1 le i le n) l_(t_i)`. Then the number of calls to nextPhrase will be bounded above by `l` since we will call next at least once on the shortest posting list per loop. So the total runtime can be bounded by the time to make `O(n \cdot l)` calls to next.

Implementing our ADT.

If our index were in RAM (we'll talk about disk-based implementations later in the semester), we would probably use a hash table to store the dictionary.
Then in the hash table, the key would be a term `t` and we store as values pointers to the first and last elements of an array `P_t[]` containing the posting list for term `t`.
Using this set up the operations first and last could be implemented in roughly constant time.
One way to implement next and prev is to binary search a term's posting list for the first value larger (resp. smaller) than current.
Let `L = max_(1 le i le n) l_(t_i)`. Using such an approach, generating all phrase matches would have complexity `O(n cdot l cdot log(L))`.
Sometimes people use `kappa` for the number of candidate phrases. In which case, we would get a bound of `O(n cdot kappa cdot log(L))`.
For some situations this implementation of next and prev works well. Namely, in the case where `l` is much smaller than `L`. For example, if we did a search on "the hogwart", the word "the" is likely to be much more common than "hogwart". So its posting list length will correspond to `L` and hogwarts to `l`.

More on Implementing next and prev

Another way to implement to implement next and prev is to just do a sequential scan of the term `t`'s posting list starting from our starting location (which we could store in a static variable) in the posting list until we find a value bigger (resp. smaller) than current.
If we used this as our implementation, then to return all matches, we would need at most `O(n cdot L)` time.
In the case where `l approx L` the first algorithm gives `O(n cdot L cdot log L)` time and so this sequential search is actually better.
So it's not clear which implementation to use.
Is there a way to get the best of both worlds? More, after this short quiz...

Quiz

Which of the following is true?

Sorting a term-frequency table for Shakespeare's works from most frequent to least frequent terms, gives an ordering of frequencies that approximately follows Zipf's law.
A 0th order language model based on `n+1` grams and an `n`-order language model are the same thing.
SOLR is a web crawler.

Galloping Search

The problem with our binary search of postings lists is that ideally we like to binary search based on an estimate of how far the next useful posting is likely to be, rather than use the end of the posting list.
Consider the problem: I am thinking of a natural number (i.e., 0, 1, 2, ...). I am a black box which will answer correctly whether a number is bigger, smaller or the same size, as the number I am thinking. Using the smallest number of queries to me, find the number I am thinking of.
One way to do this, is to use sequential scan. Ask: Is it bigger, smaller or equal to 0? Is it bigger, smaller or equal 1?... until we find the number.
We can't easily do binary search, because we don't know how big a number I might be thinking of.
So we split the problem into two steps: (1) Find how big a number I am thinking of. (2) Then binary search for the number using this.
To do (1) we ask the sequence of questions: Is it bigger, smaller or equal 2^0 ? Is it bigger, smaller or equal 2^1?, ... Is it bigger, smaller or equal 2^i?
This phase is called a "gallop phase" and our whole search technique is called galloping search. If I am thinking of the number `N` then after `log N` steps I will find a number bigger than it.
After I've found this number it will take at most `log N` more queries to get `N`. Thus, the total runtime is `O(log N)`.

Using Galloping Search to Implement Next and Prev.

function next(t, current)
{
   // P[][] = array of posting list array
   // l[] = array of length of these posting lists
   static c = array(); //last positions for terms 

   if(l[t] == 0 || P[t][l[t]] <= current) then
       return infty;
   if( P[t][1] > current) then
       c[t] := 1;
       return P[t][c[t]];

   if( c[t] > 1 && P[t][c[t] - 1] <= current ) do
      low := c[t] -1;
   else
      low := 1;

   jump := 1;

   high := low + jump;

   while (high < l[t] && P[t][high] <= current) do
      low := high;
      jump := 2*jump;
      high := low + jump;
   if(high > l[t]) then
      high := l[t];
   c[t] = binarySearch(t, low, high, current)
   return P[t][c[t]];
}

The book gives a nice analysis of the runtime returning all exact phrase matches when using this algorithm and shows it to be: `O(n cdot l cdot log (L/l))`

Documents and Other Elements

Most IR systems don't store term locations as raw numbers, as we have been doing in our schema independent set-up.
Instead, the collection is split into documents.
Within a document there might be tags, and so a document can be split into smaller units.
Using our absolute positioning, we could still find smaller units. For example, if we know "first witch" appears as the interval [745406, 745407], we knew prev("<SPEECH>", 745406) = 745404 and we knew next("</SPEECH>", 745407) = 745408, we could conclude that "first which" appeared in a SPEECH sub-document.
One can show using this set-up that our simple ADT approach would work for many XPath style queries.
At the whole document level, however, people usually do not use absolute offsets. Instead, they use a pair n:m. Here n is a docid of a document and m is a location within that document.

Document-oriented Indexes

Perhaps the most common way to split up a collection of text is into documents
Because of this, we said last day that typically people optimize their indices around the notion of document.
We introduced the notation m:n to mean offset n into the document with id m.
Using this notation, we modify our inverted index methods first, last, prev, next to output such pairs.
So for example, next("thunder", `22:288`) = `22:310` means that the next occurrence of the term "thunder" after the occurrence in the `22`nd document at location `288` is in the `22`nd document at location `310`.

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency): the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency): the number of occurrences of the term `t` in the document `d`.
`l_d` (document length): the length of document `d` measured in tokens.
`l_(avg)` (average length): the average document length across the collection.
`N` (document count): the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.