Query Processing

For the last couple of days we have been talking about the data structures used to implement an inverted index.
We now look at methods to implement query processing using these data structures
Many IR systems implement some kind of Boolean Model for retrieval.
We will look ranked retrieval and lightweight structure (ex: all documents which have "drug" and "aspirin" within 10 words) as retrieval methods
For now, we will be more interested in the efficiency of retrieval over effectiveness measures of results.

Query Processing for Ranked Retrieval

Boolean and Ranked retrieval are often complementary: One typically, determines a set of documents using a Boolean retrieval method and then rank those documents according to some ranking method.
Given a query vector ("greek", "philosophy", "stoicism"), there are two natural ways one could interpret the query: We are looking for the documents that have all of these terms; or we are looking for the documents that have any of these terms, but more terms are better than less.
The former case, called conjunctive query processing, is favored by search engines; the latter, called disjunctive query processing, by more traditional IR systems.
In TREC 45, according to the book, 7834 documents match the disjunctive interpretation, only a single document matches the conjunctive version.
The one document match was about a Mexican actor's film projects so not relevant.
Many of the disjunctive results were relevant. One reason for this is that if you are talking about stoicism, a reader can already be presumed to know that it is both a philosophy and it originated in Greece.
The book prefers disjunctive query processing, for the above, and because the authors feel that adding terms to the query should not hurt search results.

Quiz

Which of the following is true?

Synchronization points are typically stored in the index dictionary.
With the move-to-front heuristic one can reduce the table-size, excluding chains, of a hash table implementation of an inverted index dictionary, yet still achieve good performance.
Merge-based Index construction makes use of n-way disk merge sorts.

HW Problem

Exercise 4.2. Implement a fake posting list lookup in two different ways to see if a per-term index can help even when the posting lists are in RAM. In the first method assume have n, 32-bit postings. Do a lookup using binary search. In the second method, have a per-term index every 64 posting. Do a binary search on the per-term index, then sequential scan of the remainder. Test `n=2^(12)`, `2^(16)`, `2^(20)`, `2^(24)` and analyze your findings.

Answer. I didn't actually get around to implementing this, but let me tell you roughly what I would expect. In terms of comparisons, method 1 would require respectively, `12`, `16`, `20`, and `24` comparisons on average to do the lookup. For method 2, the binary search part would take `log(n/2^6)` comparisons followed by `32` comparisons on average for the sequential lookup. So the number of comparisons on average would be `38`, `42`, `46`, `50`, respectively. This is not the whole story though because of the way that memory is read into the CPU cache. Some description of this is give in Appendix A.2. Memory is read from RAM into the cache in lines of 64 bytes. So if each binary search comparison resulted in a cache miss we would expect time still proportional to the number of comparisons/CPU line lookups in the first case, but in the second case it would be proportional to `6 + 2 = 8`, `10 + 2 = 12`, `14 + 2 = 16`, and `18 + 2 = 20` CPU reads. The `2` is because the expected `32` postings scanned would require 2 line lookups.

Okapi BM25

Once we have retrieved our documents, we need to score and sort them.
For now, we will assume we are doing this using Okapi BM25 (BM25 for short).
Later in the semester we will derive this scoring function, it is defined as:
`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
The two free parameters in the above `k_1` and `b` are usually set of `1.2` and `0.75` respectively.
`b` controls the degree of document length normalization.
`k_1` caps the score contribution of an individual query term as:
`lim_(f_(t,d) -> infty) TF_(BM25)(t,d) = k_1 +1`.

BM25 Example

Suppose we have a two term query `q = (t_1, t_2)` where `N_(t_1) approx N_(t_2)`, and we are scoring two documents `d_1` and `d_2` of average length `l_(d_1) = l_(d_2) =l_(avg)`. Suppose `d_1` contains one occurrence of each of `t_1` and `t_2`, while `d_2` contains `10` occurrences of `t_1` and no occurrences of `t_2`.
Then we have: `S\c\o\r\e_(BM25)(q, d_1) = log(frac(N)(N_(t_1))) cdot (2 cdot frac(1 cdot (k_1 + 1))(1 + k_1)) approx 2 cdot log(frac(N)(N_(t_1)))`
`S\c\o\r\e_(BM25)(q, d_2) = log(frac(N)(N_(t_1))) cdot (1 cdot frac(10 cdot (k_1 + 1))(10 + k_1)) approx 1.95 cdot log(frac(N)(N_(t_1)))`
assuming `k_1 = 1.2`.
Later, when doing disjunctive query processing, we will use the upper bounds provided by `k_1` to ignore postings for which we know a priori that they cannot push the corresponding document into the top of search results.

Document-at-a-Time Query Processing

Document-at-a-Time Query Processing is the most popular form of query processing for ranked retrieval.
In this method all matching documents are enumerated, one after the other, and a score is computed for each.
Once all documents have been processed, the documents are sorted according to the the score and the top `k` results are returned.
To use BM25 in such an algorithm, we can take our code for rankCosine and simply replace the line where the score is calculated with the BM25 score calculation.
The overall time complexity of this algorithm is `Theta(m cdot n + m cdot log m)` where `n` is the number of query terms; `m` is the number of documents returned.
The `m cdot n` comes from the while loop, the `m cdot log m` from doing a sort of the result.
Notice sorting all `m` results is a waste if we are only interested in returning the top `k`.
Also in computing the score as well as the min over nextDoc, we need to iterate over all terms in the query regardless of whether the document has them or not.
We can use heaps to address these two problems.

Binary Heaps

Recall a (binary) heap is a binary tree that satisfies: (1) the empty binary tree is a heap, (2) a non-empty binary tree `T` is a heap, if (a) `T` is completely filled on all levels, except the deepest one, (b) `T`'s deepest level is filled from left to right, (c) For each node `v` in `T`, the value stored in `v` is smaller than the value stored in any of its children.
Heaps can be represented either using trees or using arrays, the latter almost always being used as it is faster.
Heaps support an operation called REHEAP where we take the root value of the tree and replace it with a new value and push the new value down the heap and until the heap property is restored.
REHEAP can be done in `O(log n)` steps.

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    for(i =1 to k) {
       results[i].score := 0;
    }
    for (i = 1 to n) { // create a min-heap for top k results
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc //establish heap for terms
    while (terms[0].next < infty) {
        d := terms[0].nextDoc;
        score := 0;
        while(terms[0].nextDoc == d) {
            t := terms[0].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[0].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[0].score) {
            results[0].docid := d;
            results[0].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(m cdot log(n) + m \cdot log(k))`.

Query Processing

Outline

Query Processing

Query Processing for Ranked Retrieval

Quiz

HW Problem

Okapi BM25

BM25 Example

Document-at-a-Time Query Processing

Binary Heaps

Query Processing with Heaps