Galloping Search; Document-Oriented Indexes; the Vector Space Model




CS267

Chris Pollett

Sep 11, 2019

Outline

Introduction

ADT Example: Phrase Search

Generating all Occurrences.

Implementing our ADT.

More on Implementing next and prev

Galloping Search (aka Exponential Search)

Using Galloping Search to Implement Next and Prev. (Bentley-Yao 1976)

function next(t, current)
{
   // P[][] = array of posting list arrays
   // l[] = array of lengths of these posting lists
   static c = []; //last index positions for terms 

   if(l[t] == 0 || P[t][l[t]] <= current) then
       return infty;
   if( P[t][1] > current) then
       c[t] := 1;
       return P[t][c[t]];

   if( c[t] > 1 && P[t][c[t] - 1] <= current ) do
      low := c[t] -1;
   else
      low := 1;

   jump := 1;

   high := low + jump;

   while (high < l[t] && P[t][high] <= current) do
      low := high;
      jump := 2*jump;
      high := low + jump;
   if(high > l[t]) then
      high := l[t];
   c[t] = binarySearch(t, low, high, current)
   return P[t][c[t]];
}

The book gives a nice analysis of the runtime returning all exact phrase matches when using this algorithm and shows it to be: `O(n cdot l cdot log (L/l))`

In-Class Exercise

Documents and Other Elements

Document-oriented Indexes

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency)
the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency)
the number of occurrences of the term `t` in the document `d`.
`l_d` (document length)
the length of document `d` measured in tokens.
`l_(avg)` (average length)
the average document length across the collection.
`N` (document count)
the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.