Query Processing

Prior to the midterm, we were talking about the data structures used to implement an inverted index.
We now look at methods to implement query processing using these data structures.
Many IR systems implement some kind of Boolean Model for retrieval.
We will look at ranked retrieval and lightweight structure (ex: all documents which have "drug" and "aspirin" within 10 words) as retrieval methods.
For now, we will be more interested in the efficiency of retrieval over effectiveness measures of results.

Query Processing for Ranked Retrieval

Boolean and Ranked retrieval are often complementary: One typically, determines a set of documents using a Boolean retrieval method and then rank those documents according to some ranking method.
Given a query vector ("greek", "philosophy", "stoicism"), there are two natural ways one could interpret the query: We are looking for the documents that have all of these terms; or we are looking for the documents that have any of these terms, but more terms are better than less.
The former case, called conjunctive query processing, is favored by search engines; the latter, called disjunctive query processing, by more traditional IR systems.
In TREC 45, according to the book, 7834 documents match the disjunctive interpretation, only a single document matches the conjunctive version.
The one document match was about a Mexican actor's film projects so not relevant.
Many of the disjunctive results were relevant. One reason for this is that if you are talking about stoicism, a reader can already be presumed to know that it is both a philosophy and it originated in Greece.
The book prefers disjunctive query processing, for the above, and because the authors feel that adding terms to the query should not hurt search results.

Okapi BM25

Once we have retrieved our documents, we need to score and sort them.
For now, we will assume we are doing this using Okapi BM25 (BM25 for short).
Later in the semester we show how sophisticated scoring function can be derived. For BM25, we are just going to define it and five some intuition. First, the definition:
`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
The two free parameters in the above `k_1` and `b` are usually set of `1.2` and `0.75` respectively.
`b` controls the degree of document length normalization.
`k_1` caps the score contribution of an individual query term as:
`lim_(f_(t,d) -> infty) TF_(BM25)(t,d) = k_1 +1`.

BM25 Example

Suppose we have a two term query `q = (t_1, t_2)` where `N_(t_1) approx N_(t_2)`, and we are scoring two documents `d_1` and `d_2` of average length `l_(d_1) = l_(d_2) =l_(avg)`. Suppose `d_1` contains one occurrence of each of `t_1` and `t_2`, while `d_2` contains `10` occurrences of `t_1` and no occurrences of `t_2`.
Then we have: `S\c\o\r\e_(BM25)(q, d_1) = log(frac(N)(N_(t_1))) cdot (2 cdot frac(1 cdot (k_1 + 1))(1 + k_1)) approx 2 cdot log(frac(N)(N_(t_1)))`
`S\c\o\r\e_(BM25)(q, d_2) = log(frac(N)(N_(t_1))) cdot (1 cdot frac(10 cdot (k_1 + 1))(10 + k_1)) approx 1.95 cdot log(frac(N)(N_(t_1)))`
assuming `k_1 = 1.2`.
Later, when doing disjunctive query processing, we will use the upper bounds provided by `k_1` to ignore postings for which we know a priori that they cannot push the corresponding document into the top of search results.

Quiz

Which of the following is true?

A per-term index is an index with both a sort-based and hash based dictionary.
The move-to-front heuristic is used as part of sort-based dictionary construction.
In merge-based index construction, if the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition.

Document-at-a-Time Query Processing

Document-at-a-Time Query Processing is the most popular form of query processing for ranked retrieval.
In this method all matching documents are enumerated, one after the other, and a score is computed for each.
Once all documents have been processed, the documents are sorted according to their score and the top `k` results are returned.
To use BM25 in such an algorithm, we can take our code for rankCosine and simply replace the line where the score is calculated with the BM25 score calculation.
The overall time complexity of this algorithm is `Theta(m cdot n + m cdot log m)` where `n` is the number of query terms; `m` is the number of documents returned.
The `m cdot n` comes from the while loop, the `m cdot log m` from doing a sort of the result.
Notice sorting all `m` results is a waste if we are only interested in returning the top `k`.
Also in computing the score as well as the min over nextDoc, we need to iterate over all terms in the query regardless of whether the document has them or not.
We can use heaps to address these two problems.
In order, to get a bound on the resulting algorithm we introduce a little notation.
Let `N_q = N_{t_1} + N_{t_2} cdots + N_{t_n}`. Then `m` is somewhere between `N_q/n` and `N_q` and we also get the run time bound `Theta(N_q cdot n + N_q cdot log N_q)`. In practice, it will tend to be the case that a document will contain only one of the query terms and so `m` will tend to be closer to `N_q` than `N_q/n`.

Binary Heaps

Recall a (binary) heap is a binary tree that satisfies: (1) the empty binary tree is a heap, (2) a non-empty binary tree `T` is a heap, if (a) `T` is completely filled on all levels, except the deepest one, (b) `T`'s deepest level is filled from left to right, (c) For each node `v` in `T`, the value stored in `v` is smaller than the value stored in any of its children.
Heaps can be represented either using trees or using arrays, the latter almost always being used as it is faster.
Heaps support an operation called REHEAP where we take the root value of the tree and replace it with a new value and push the new value down the heap and until the heap property is restored.
REHEAP can be done in `O(log n)` steps.

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    for(i =1 to k) {
       results[i].score := 0;
    }
    // create a min-heap for top k results
    for (i = 1 to n) { 
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc //establish heap for terms
    while (terms[0].nextDoc < infty) {
        d := terms[0].nextDoc;
        score := 0;
        while(terms[0].nextDoc == d) {
            t := terms[0].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[0].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[0].score) {
            results[0].docid := d;
            results[0].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.

MaxScore

Recall the term frequency contribution of BM25 can never exceed `k_1 + 1 = 2.2`.
So the overall score contribution of a terms `t` is bounded from above by `2.2cdot log(N/N_t)`.
This bound is called the MaxScore.
On the query (greek, philosophy, stoicism), the MaxScore's for these three terms might be `15.1`, `16.1`, and `28.9` respectively.
If we are only interested in the top 10 results, it might happen the lowest of the top 10 results we have seen so far has a score greater than `15.1`.
What this means is that if a document only contains the word greek and not the two other words it will never get into the top 10 results.
We can remove "greek" from the term heap. We still add the score of the greek contribution of a document, but only for documents coming from the other two terms in the heap.
When the `k`th best result so-far gets above 31.2 (maxscore of greek and philosophy), we can remove "philosophy" from the heap and only look at documents containing "stoicism".
This strategy is called MaxScore and in the book's tests triples the `k=10` query speed, making it take only 1.5 X the time of a conjunctive query.

Term-at-a-Time Query Processing

Instead of merging the query terms postings' lists by using a heap, you could imagine we score all documents for term 1, store the result into an array called an accumulator, then scan over the posting list of term 2 compute scores and add these to the scores in the accumulator and so on.
This approach to query processing is called term-at-a-time query processing.
The index is stored on disk and if we use document-at-a-time query processing we tend to access this disk in a non-sequential fashion.
On the other hand, posting lists tend to be sequentially laid out on the disk, so the seeks in doing term-at-a-time query processing should be faster.
Since term-at-a-time accesses each posting list separately, it is typically only used for scoring functions that are of the form:
`sc\o\re(q,d) = quality(d) + sum_(t in q) sc\o\re(t,d)`.
Here `quality(d)` is an optional query-independent score component such as PageRank.

Query Processing

Outline

Query Processing

Query Processing for Ranked Retrieval

Okapi BM25

BM25 Example

Quiz

Document-at-a-Time Query Processing

Binary Heaps

Query Processing with Heaps

MaxScore

Term-at-a-Time Query Processing