Introduction

Last week, we were talking about techniques to implement our inverted index ADT.
We said the next and previous functions should be implemented using galloping search.
We then said how our ADT should be modified if we view our corpus as a collection of documents rather than schema independent.
In particular, we can view a term location in the corpus as a pair `n : m` where `n` is the docid and `m` is a position in a document.
With this notation we extend our ADT with the functions: firstDoc(t), lastDoc(t), nextDoc(t, current), and prevDoc(t, current).
Finally, we introduced some notations with respect to terms and documents: `N`, `N_t`, `f_{t,d}`, `l_d`, `l_{avg}`.
Today, we begin by classifying document oriented inverted indexes based on what they store in their posting lists...

Posting Lists and Index Types

When using a schema-dependent index, like a document index, it is common to modify the way we store posting lists for a term `t`.
Rather than have a sequence of offsets for that terms, instead we have postings of the form:
`langle d, f_(t,d), langle p_0, ... p_(f_(t,d)-1)rangle rangle` in which we group all occurrences of the term `t` in the same document together. For example a posting for "witch" might look like: `langle 1, 3, langle 1598, 27555, 31463 rangle rangle` which says witch occurs in document 1, 3 times and that it occurs at locations 1598, 27555, and 31463.
Often times, storing the exact positions of the terms is not necessary, and can take a lot of space.
Document-oriented indexes can actually be classified by what components of this basic posting 3-tuple we decide to store:
- A docid index stores for each term just the documents in which it appears.
- A frequency index stores for each term and for each document it appears in the pair `langle d, f_(t,d) rangle`
- A positional index stores for each term and for each document contain the term the full triple `langle d, f_(t,d), langle p_0, ... p_(f_(t,d))rangle rangle`.
- Finally, as we have already discussed, a schema-independent index does not store any of the document oriented optimizations of a positional index, but otherwise stores the same info.

Ranking and Retrieval

We now look at some ways we can use our document-oriented ADT for retrieval documents, and ordering the results returned according to relevance.
Queries for ranked retrieval are often called term vectors.
The components of this vector when entered into an IR system are typically delimited by white space.
So the query:
william shakespeare wedding
would be viewed as the vector `langle "william", "shakespeare", "wedding" rangle`.
Using vectors rather than sets is useful because some meaningful queries might involve repeated terms. For example, "to be or not to be", which if we ignore the duplicates might yield sub-standard results because it would not be so closely identified with Hamlet.
When we retrieve using term vectors we often are interested in documents whose own vector is somehow close to the query term vector.
Closeness might not mean the document has all of the terms that occur in the term vector.
Another way to build up queries is using the operations "AND", "OR", "NOT". For example,
"william" AND "shakespeare" AND NOT ("marlowe" OR "bacon")
For a document to be returned in the case of an AND both of its inputs must hold true for that document. Similarly, for OR, either or both of its input must be true, and NOT means its input does not hold. So for a document to be returned on the query "william" and "shakespeare", the document MUST contain both terms.
We call a query only using ANDs a conjunctive query and one only using ORs a disjunctive query.

The Vector Space Model

The vector space model is one of the earliest information retrieval models, going back to work by Salton (on SMART at Harvard/Cornell) in the 1960s.
In this model, both queries and documents are viewed as vectors.
These vectors have one component for each term that occurs in the vocabulary `V` of the collection. You can think of a component as storing a real number that somehow measures how important that term was to the query or to the document.
We call the vector corresponding to a query, a query vector and a vector corresponding to a document a document vector.
Unlike term vectors, query vectors don't allow for repeated terms, they are also of much higher dimensionality as they have components (often many of these will be zero) for each vocabulary term.

Cosine Similarity

Given two such `|V|`-dimensional vectors: `vec x, vec y`; say one for a query, one for a document; we can measure their closeness by taking their dot product.
Recall from linear algebra that the dot product is straightforward to compute using the formula:
`vec x cdot vec y = sum_(i-1)^(|V|)x_i cdot y_i`.
The dot product has the following geometric meaning:
`vec x cdot vec y = |vec x||vec y| cos theta`
where `theta` is the angle between `vec x, vec y`. `|vec v|` means the length of the vector `vec v`. It can be computed as
`|vec v| = sqrt(sum_(i=1)^(|V|)v_i^2)`.
Using these equations, we can solve for `theta` as:
`cos(theta) = frac(sum_(i-1)^(|V|)x_i cdot y_i)(sqrt(sum_(i=1)^(|V|)x_i^2)sqrt(sum_(i=1)^(|V|)y_i^2))`.
Notice when `theta = 0^circ`, `cos(theta) = 1`, and the two vectors are collinear and could be viewed as similar, and if `theta = 90^circ`, `cos(theta) = 0`, and the two vectors are orthogonal, or as far apart as possible.
We will only have positive components of our vectors so we won't get negative cosine values.
Thus, if given a query vector `vec q` and a document vector `vec d` it makes sense to define their similarity as the cosine of the angle between them. i.e., Define: `sim(vec d, vec q) = frac(vec d)(|vec d|)cdot frac(vec q)(|vec q|)`

TF-IDF

Using our vector space model, the dimensionality of the vectors in question might be in the millions -- there could be a lot of distinct words in our corpus. So at first blush it might seem for even a small corpus that computing this cosine similarity would be costly.
The important thing to note is that the actual runtime depends on only the number of non-zero entries in the vector with fewer non-zero entries (usually the query), and this is typically small.
We still have to figure out what real values to use for the components of our query and document vectors.
The most common way to do this is to use TF-IDF weights.
Here TF means Term Frequency and is some function which measures how common the term is in the given document, and IDF is inverse document frequency which is typically relates the document frequency to the total number of documents in the corpus.
Over the years, several different functions have been proposed for TF and for IDF. For this class, we will define `IDF = log(frac(N)(N_t))` and we will define `TF = log (f_(t,d)) + 1` if `f_(t,d) > 0` and `0` otherwise.
We use log's (base 2) for TF since this models the number of bits of information we are getting by knowing the exact frequency of the number of occurrences of the term in the document.
Notice if a term appears in close to every document, it doesn't tell you much to say the document has the term. The IDF in this case is small.
For example, if we have a collection of five documents and the word "sir" appears in four, then `IDF(sir) = log(5/4) = 0.32`. If in document 2 the term "sir" occurs twice, then its TF would be `log(f_(t,d)) + 1 = 2`. So the TF-IDF("doc 2","sir"), which is the product of these two values, would be 0.64.
This would represent the weight we would use for the "sir" component of doc 2's document vector. We could have one such component for each vocabulary term.

Algorithm to Compute Cosine Rank

rankCosine(t[1],...t[n], k) 
// t is an array of query terms
// k is the number of documents we want to return
{
    j := 1
    d := min_(1 <= i <= n) nextDoc(t[i], - infty)
    //we only need to consider docs containing at least one term
    while d < infty do
        Result[j].docid := d;
        Result[j].score := sim(vec d, vec t)
        j++;
        d := min_(1 <= i <= n, nextDoc(t[i], d))
    sort Result by score;
    return Result[1 .. k];
}

Quiz

Which of the following is true?

PHP has different types for tuples, arrays, associative arrays.
If `l` is the length of the shortest posting list and `L` is the length of the longest posting list for terms in a query. Then the runtime for the return all phrase occurrence algorithm is `O(n cdot l cdot log (L/l))` if galloping search is used for next() and prev().
Using a schema independent index it is impossible to handle simple xpath queries like determining if a passage is contained within a pair of XML tags.

Proximity Ranking

Results and ranking in the VSM depend only on term frequency, TF, and Inverse Document Frequency, IDF.
The proximity ranking method in contrast depends explicitly on how close the terms in the term vector occur, but only implicitly depends of TF and IDF plays no role at all.
Given a term vector `langle t_1, ... t_n rangle`, we define a cover for this vector to be an interval `[u,v]` in the collection that contains all the terms in the vector, such that no smaller interval `[u', v']` contained in `[u,v]` also has a match to all the terms in the vector.
For example, if the document 1 in our collection began as "To be or not to be", and the term vector was "to be" then the intervals `[1:1, 1:2]` and `[1:5, 1:6]` would be covers, but `[1:1, 1:3]` would not be as it contains `[1:1, 1:2]`.
Covers are allowed to overlap. If "meow meow" was the term vector and document 2 began "meow meow meow meow meow meow ..." then the intervals `[1:1, 1:2]` and `[1:2, 1:3]` would both be covers.
A given token though can appear in at most `n cdot l` covers, where `l` is the length of the shortest posting list.

Algorithm for Finding Covers

nextCover(t[1],.., t[n], position) 
{
    v:= max_(1≤ i ≤ n)(next(t[i], position));
    if(v == infty)
        return [ infty, infty];
    u := min_(1≤ i ≤ n)(prev(t[i], v+1))
    if(docid(u) == docid(v) ) then 
        return [u,v]; 
       // covers need to be in the same document
    else
        return nextCover(t[1],.., t[n], u);
}

Ranking Covers

We now want to come up with a formula that scores documents according to the covers it contains for the term query.
We would like that smaller covers are worth more than larger covers.
We would also like more covers in a document to count more than fewer covers in a document.
Keeping this in mind, suppose document `d` has covers `[u_1, v_1]`, `[u_2, v_2]`, ...
Then we define the score for `d` to be:
`mbox(score)(d) = sum(frac(1)(v_i - u_i + 1))`.

Ranking Algorithm with Proximity Scores

rankProximity(t[1],.., t[n], k)
// t[] term vector
// k number of results to return 
{
    u := - infty;
    [u,v] := nextCover(t[1],.., t[n], u);
    d := docid(u);
    score := 0;
    j := 0;
    while( u < infty) do
        if(d < docid(u) ) then
        // if docid changes record info about last docid
            j := j + 1;
            Result[j].docid := d;
            Result[j].score := score;
            d := docid(u);
            score := 0;
        score := score + 1/(v - u +1);
        [u, v] := nextCover(t[1],.., t[n], u);
    if(d < infty) then
        // record last score if not recorded
        j := j + 1;
        Result[j].docid := d;
        Result[j].score := score;
    sort Result[1..j] by score;
    return Result[1..k];   
}

Using an analysis similar to that used for galloping search in the book, you can prove this algorithm has running time:
`O(n^2 l cdot log(L/l))`.

Boolean Retrieval

Besides being used as implicit filters in some web search engines, explicit support for Boolean queries is important for some specific applications such as digital libraries and in the legal domain.
For example, when we do a patent search, we want an exhaustive list of patents returned which match the term query, not just the top 10.
In the Boolean retrieval model we thus return sets of results rather than ranked lists.
The standard boolean operators AND, OR, and NOT are used to construct queries.
Here "A AND B" means the intersection of sets A and B; "A OR B" means the union of the sets A and B; and NOT A means those documents in the collection not contained in A.
A term `t` corresponds in a query to the collection of documents containing `t`.
So ("quarrel" OR "sir") AND "you" represents the set of documents that contain "you" and also contain at least one of the two terms "quarrel" and "sir".
An algorithm to solve a Boolean query locates candidate solutions to the query, where each candidate solution represents a range of documents that together satisfy the query.

Extending Our ADT for Boolean Retrieval

To simplify our Boolean search algorithm, we define two functions that operate over Boolean queries:

docRight(Q, u)
end point of the first candidate solution to `Q` starting after document `u`

docLeft(Q, v)
start point of the last candidate solution to `Q` ending before document `v`
For terms we define docRight(t, u) := nextDoc(t, u) and docLeft(t,v) := prevDoc(t,v).

For AND and OR operators we define:

docRight(A AND B, u) := max(docRight(A, u), docRight(B,u))
docLeft(A AND B, v) := min(docLeft(A,v), docLeft(B,v))
docRight(A OR B, u) := min(docRight(A, u), docRight(B,u))
docLeft(A OR B, v) := max(docLeft(A,v), docLeft(B,v))

Notice the above rules give us a recursive algorithm given a Boolean query `Q` to compute docRight, docLeft.

Algorithm to Return the Next Solution to a Positive Boolean Query (No NOT's).

nextSolution(Q, position)
{
    v := docRight(Q, position);
    if v = infty then
        return infty;
    u := docLeft(Q, v+1);
    if(u == v) then
        return u;
    else
        return nextSolution(Q, v);
}

Algorithm to Return All Solutions to a Positive Boolean Query

u :=  -infty
while u < infty do
    u := nextSolution(Q, u);
    if(u < infty) then
        report docid(u);

If we implement nextDoc, prevDoc with galloping search, the complexity of this algorithm is `O(n cdot l cdot log(L/l))`

Handling Queries with NOT

To handle negations we first use De Morgan's rules to push negations so that they are only on terms.
i.e., NOT (A AND B ) = NOT A OR NOT B and NOT (A OR B) = NOT A AND NOT B
Then we enhance our methods nextDoc and prevDoc, so that they can directly handle negations. i.e., So nextDoc(NOT t, u) and prevDoc(NOT t, u) make sense.
Finally, we set docRight(NOT t, u) := nextDoc(NOT t, u) and docLeft(NOT t,v) := prevDoc(NOT t,v).

VSM, Proximity Ranking, Boolean Retrieval

Outline