Introduction

On Monday we did a world-wind tour of the PHP language that we'll use to understand how the Yioop search engine is coded.
We then started looking at how the inverted index ADT we presented on last Wednesday might be implemented.
first(t) and last(t) are easy to implement using pointers.
We said that the function to generate all occurrences (let's call this function generateAll) of a phrase in corpus will make `O(n \cdot l)` calls to next(t, current) where `l` is the length of the shortest posting list amongst those in the phrase.
Let `L` be the length of the longest posting list for the terms in the phrase.
We gave two ways to implement next(t, current): (1) binary search between first and last. In this case, the run time of generateAll is `O(n \cdot l cdot \log L)`. (2) Sequential scan beginning with first(t). In this case, the run time of generateAll is `O(n \cdot L)`
Each has its advantages as sometimes `l cdot \log L < L` and sometimes `l cdot \log L > L`.
We finished last day by asking: Is there a way to get the best of both of these algorithms?

Galloping Search

The problem with our binary search of postings lists is that ideally we like to binary search based on an estimate of how far the next useful posting is likely to be, rather than use the end of the posting list.
Consider the problem: I am thinking of a natural number (i.e., 0, 1, 2, ...). I am a black box which will answer correctly whether a number is bigger, smaller or the same size, as the number I am thinking. Using the smallest number of queries to me, find the number I am thinking of.
One way to do this, is to use sequential scan. Ask: Is it bigger, smaller or equal to 0? Is it bigger, smaller or equal 1?... until we find the number.
We can't easily do binary search, because we don't know how big a number I might be thinking of.
So we split the problem into two steps: (1) Find how big a number I am thinking of. (2) Then binary search for the number using this.
To do (1) we ask the sequence of questions: Is it bigger, smaller or equal `2^0` ? Is it bigger, smaller or equal `2^1`?, ... Is it bigger, smaller or equal `2^i`?
This phase is called a "gallop phase" and our whole search technique is called galloping search. If I am thinking of the number `N` then after `log N` steps I will find a number bigger than it.
After I've found this number it will take at most `log N` more queries to get `N`. Thus, the total runtime is `O(log N)`.

Using Galloping Search to Implement Next and Prev.

function next(t, current)
{
   // P[][] = array of posting list arrays
   // l[] = array of lengths of these posting lists
   static c = []; //last index positions for terms 

   if(l[t] == 0 || P[t][l[t]] <= current) then
       return infty;
   if( P[t][1] > current) then
       c[t] := 1;
       return P[t][c[t]];

   if( c[t] > 1 && P[t][c[t] - 1] <= current ) do
      low := c[t] -1;
   else
      low := 1;

   jump := 1;

   high := low + jump;

   while (high < l[t] && P[t][high] <= current) do
      low := high;
      jump := 2*jump;
      high := low + jump;
   if(high > l[t]) then
      high := l[t];
   c[t] = binarySearch(t, low, high, current)
   return P[t][c[t]];
}

The book gives a nice analysis of the runtime returning all exact phrase matches when using this algorithm and shows it to be: `O(n cdot l cdot log (L/l))`

In-Class Exercise

I'd like you to code a simple PHP script. It first declares a function:
```
function binarySearch($t, $P, $low, $high, $current)
```
and implements the binary search which could be used as part of a PHP implementation of the next(t, current) function of the previous slide.
Here $t is a term, $P is an array of posting lists, $low, $high, and $current are int's.

Test your function by adding the following after the definition of binarySearch:

$t = "dog";
$P = [
   "cat" => [1, 6, 7],
   "dog" => [1 ,2, 23, 25, 27, 50]
];
echo  binarySearch($t, $P, 1, 4, 15);

Once you are done, upload your code to the Sep 12 In-Class Exercise Thread, and I'll try to look at it and give some feedback in class.

Documents and Other Elements

Most IR systems don't store term locations as raw numbers, as we have been doing in our schema independent set-up.
Instead, the collection is split into documents.
Within a document there might be tags, and so a document can be split into smaller units.
Using our absolute positioning, we could still find smaller units. For example, if we knew "first witch" appears as the interval [745406, 745407] and we knew prev("<SPEECH>", 745406) = 745404 and we knew next("</SPEECH>", 745407) = 745408, we could conclude that "first which" appeared in a SPEECH sub-document.
One can show using this set-up that our simple ADT approach would work for many XPath style queries.
At the whole document level, however, people usually do not use absolute offsets. Instead, they use a pair n:m. Here n is a docid of a document and m is a location within that document.

Document-oriented Indexes

Perhaps the most common way to split up a collection of text is into documents
Because of this, typically people optimize their indices around the notion of document.
We use the notation m:n to mean offset n into the document with id m.
Using this notation, we modify our inverted index methods first, last, prev, next to output such pairs.
So for example, next("thunder", `22:288`) = `22:310` means that the next occurrence of the term "thunder" after the occurrence in the `22`nd document at location `288` is in the `22`nd document at location `310`.

Document Statistics

When working with documents there are several common statistics which people typically keep track of:

`N_t` (document frequency): the number of documents in the collection containing the term `t`.
`f_(t,d)` (term frequency): the number of occurrences of the term `t` in the document `d`.
`l_d` (document length): the length of document `d` measured in tokens.
`l_(avg)` (average length): the average document length across the collection.
`N` (document count): the total number of documents in the collection.

Also, when working with document-oriented indexes it is common to support coarser grained methods in our ADT, such as firstDoc(`t`), lastDoc(`t`), nextDoc(`t`, `mbox(current)`), and prevDoc(`t`, `mbox(current)`). The idea of a method like nextDoc, is that it returns the first document with the term `t` after `current` in the corpus. i.e., we don't care about position in the document with this method.

Posting Lists and Index Types

When using a schema-dependent index, like a document index, it is common to modify the way we store posting lists for a term `t`.
Rather than have a sequence of offsets for that terms, instead we have postings of the form:
`langle d, f_(t,d), langle p_0, ... p_(f_(t,d)-1)rangle rangle` in which we group all occurences of the term `t` in the same document together. For example a posting for "witch" might look like: `langle 1, 3, langle 1598, 27555, 31463 rangle rangle` which says witch occurs in document 1, 3 times and that it occurs at locations 1598, 27555, and 31463.
Often times, storing the exact positions of the terms is not necessary, and can take a lot of space.
Document-oriented indexes can actually be classified by what components of this basic posting 3-tuple we decide to store:
- A docid index stores for each term just the documents in which it appears.
- A frequency index stores for each term and for each document it appears in the pair `langle d, f_(t,d) rangle`
- A positional index stores for each term and for each document contain the term the full triple `langle d, f_(t,d), langle p_0, ... p_(f_(t,d))rangle rangle`.
- Finally, as we have already discussed, a schema-independent index does not store any of the document oriented optimizations of a positional index, but otherwise stores the same info.

Ranking and Retrieval

We now look at some ways we can use our document-oriented ADT for retrieval documents, and ordering the results returned according to relevance.
Queries for ranked retrieval are often called term vectors.
The components of this vector when entered into an IR system are typically delimited by white space.
So the query:
william shakespeare wedding
would be viewed as the vector `langle "william", "shakespeare", "wedding" rangle`.
Using vectors rather than sets is useful because some meaningful queries might involve repeated terms. For example, "to be or not to be", which if we ignore the duplicates might yield sub-standard results because it would not be so closely identified with Hamlet.
When we retrieve using term vectors we often are interested in documents whose own vector is somehow close to the query term vector.
Closeness might not mean the document has all of the terms that occur in the term vector.
Another way to build up queries is using the operations "AND", "OR", "NOT". For example,
"william" AND "shakespeare" AND NOT ("marlowe" OR "bacon")
For a document to be returned in the case of an AND both of its inputs must hold true for that document. Similarly, for OR, either or both of its input must be true, and NOT means its input does not hold. So for a document to be returned on the query "william" and "shakespeare", the document MUST contain both terms.
We call a query only using ANDs a conjunctive query and one only using ORs a disjunctive query.

The Vector Space Model

The vector space model is one of the earliest information retrieval models, going back to work by Salton (on SMART at Harvard/Cornell) in the 1960s.
In this model, both queries and documents are viewed as vectors.
These vectors have one component for each term that occurs in the vocabulary `V` of the collection. You can think of a component as storing a real number that somehow measures how important that term was to the query or to the document.
We call the vector corresponding to a query, a query vector and a vector corresponding to a document a document vector.
Unlike term vectors, query vectors don't allow for repeated terms, they are also of much higher dimensionality as they have components (often many of these will be zero) for each vocabulary term.

Cosine Similarity

Given two such `|V|`-dimensional vectors: `vec x, vec y`; say one for a query, one for a document; we can measure their closeness by taking their dot product.
Recall from linear algebra that the dot product is straightforward to compute using the formula:
`vec x cdot vec y = sum_(i-1)^(|V|)x_i cdot y_i`.
The dot product has the following geometric meaning:
`vec x cdot vec y = |vec x||vec y| cos theta`
where `theta` is the angle between `vec x, vec y`. `|vec v|` means the length of the vector `vec v`. It can be computed as
`|vec v| = sqrt(sum_(i=1)^(|V|)v_i^2)`.
Using these equations, we can solve for `theta` as:
`cos(theta) = frac(sum_(i-1)^(|V|)x_i cdot y_i)(sqrt(sum_(i=1)^(|V|)x_i^2)sqrt(sum_(i=1)^(|V|)y_i^2))`.
Notice when `theta = 0^circ`, `cos(theta) = 1`, and the two vectors are collinear and could be viewed as similar, and if `theta = 90^circ`, `cos(theta) = 0`, and the two vectors are orthogonal, or as far apart as possible.
We will only have positive components of our vectors so we won't get negative cosine values.
Thus, if given a query vector `vec q` and a document vector `vec d` it makes sense to define their similarity as the cosine of the angle between them. i.e., Define: `sim(vec d, vec q) = frac(vec d)(|vec d|)cdot frac(vec q)(|vec q|)`

Galloping Search; Document-Oriented Indexes; the Vector Space Model

Outline