Introduction

First, some announcements: (1) I've moved the HW3 due date to Nov. 2. (2) I will be covering the trec_eval stuff Wednesday. (3) I am looking for localizers (Hindi, Russian, German, etc) for Yioop! if anyone wants to volunteers. (4) We are a little behind my start of semester schedule, I will try to revise the schedule of topics by next day. (5) Hw2 solutions up.
On Wednesday, we were looking at document-at-a-time query processing.
We were in the disjunctive model: We might return any document that contains at least one of the query terms.
We cycled through the documents that contain at least one query term, assign a score using BM25, and return the top `k` scores.
The last thing we looked at was how to improve the speed of this using heaps.
We begin today by looking at one more improvement and then start to discuss how to implement term at time query processing.

MaxScore

Recall from last day the term frequency contribution of BM25 can never exceed `k_1 + 1 = 2.2`.
So the overall score contribution of a terms `t` is bounded from above by `2.2cdot log(N/N_t)`.
This bound is called the MaxScore.
On the query (greek, philosophy, stoicism), the MaxScore's for these three terms might be `15.1`, `16.1`, and `28.9` respectively.
If we are only interested in the top 10 results, it might happen the lowest of the top 10 results we have seen so far has a score greater than `15.1`.
What this means is that if a document only contains the word greek and not the two other words it will never get into the top 10 results.
We can remove "greek" from the term heap. We still add the score of the greek contribution of a document, but only for documents coming from the other two heaps.
When the 10th best result so-far gets above 31.2, we can remove "philosophy" from the heap and only look at documents containing "stoicism".
This strategy is called MaxScore and in the book's tests triples the `k=10` query speed, making it take only 1.5 X the time of a conjunctive query.

Quiz

Which of the following is true?

Both sort-based indexing and merge-based indexing might involve doing merging on disk.
The contribution of an individual term in BM25 grows could be unbounded as a function of `f_(t,d)`.
The worst-case query lookup when heaps are used is `Omega(N_q^2)` where `N_q` is the sum of the length of the posting lists of the query terms.

Term-at-a-Time Query Processing

Instead of merging the query terms postings' lists by using a heap, you could imagine we score all documents for term 1, store the result into an array called an accumulator, then scan over the posting list of term 2 compute scres and add these to the scores in the accumulator and so on.
This approach to query processing is called term-at-a-time query processing.
The index is stored on disk and if we use document-at-a-time query processing we tend to access this disk in a non-sequential fashion.
On the other hand, posting lists tend to be sequentially laid out on the disk, so the seeks in doing term-at-a-time query processing should be faster.
Since term-at-a-time accesses each posting list separately, it is typically only used for scoring functions that are of the form:
`sc\o\re(q,d) = quality(d) + sum_(t in q) sc\o\re(t,d)`.
Here `quality(d)` is an optional query-independent score component such as PageRank.

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of `N_t_i`;
    acc := {}, acc' := {}; //initialize accumulators.
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'

        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy before first doc of t[i] the came from earlier t[j]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N_t) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

The worst case complexity of this algorithm is `Theta(N_q cdot n + N_q cdot log(k))`.

Notice the `n` rather than `log n`. This is caused because we traverse the entire accumulator set for every term `t`. So this is slightly slower than our heap-based document-at-a-time approach.

Accumulator Pruning

We now look at some strategies to speed bu the term-at-a-time approach.
One problem with the algorithm so far, is that it assumes we can keep `a\c\c` in memory, which given we are assuming the the posting lists are long seems questionable.
To fix this, we assume we have an upper bound `a_max` on the elements contained accumulator structures.
One strategy, called QUIT, to ensure we don't go above this bound is to check after each term if we have `|a\c\c| ge a_max`. If so, we stop and return the results we have.
Another strategy called, CONTINUE, is to continue processing postings lists if we find `|a\c\c| ge a_max` after processing a term, but never to add more accumulators.
Both these strategies are due to Moffat and Zobel (1996). Notice neither actually forces an upper bound on the memory used, since we do the check after a term is completely processed.
Lester et al (2005) propose a better pruning strategy which we now consider.

More Accumulator Pruning

In our algorithm so far we process terms from least to most frequent.
Given that we only have a limited number of accumulators, this means we will tend to spend them on documents which are likely to make the top `k`.
We want to devise a rule that tells us for a given posting whether it deserves its own accumulator or not.
Suppose we have processed `i-1` query terms, and have used `a_(c\u\r\r\e\n\t)` many accumulators. We are about to start processing `t_i`.
There are three possible situations:
1. `a_(c\u\r\r\e\n\t) + N_(t_i) le a_(max)`. In this case we have enough free accumulators for all of `t_i`'s postings, so no pruning is done.
2. `a_(c\u\r\r\e\n\t) = a_(max)`. In this case the accumulator limit has already been reached. So none of `t_i`'s posting will be allowed to create accumulators.
3. `a_(c\u\r\r\e\n\t) < a_(max) < a_(c\u\r\r\e\n\t) + N_(t_i)`. In this case, we may need to do pruning.
Ideally, we would like to give the `a_(max) - a_(c\u\r\r\e\n\t)` most important postings their own accumulator, but how to find them?

Assigning Accumulators

A possible two-pass approach to this problem is to: (1) make a pass and score contributions for all of `t_i` to find a threshold `T` such that only `a_(max) - a_(c\u\r\r\e\n\t)` docs have scores over this threshold. (2) Make a second pass and give anything above `T` its own accumulator.
To do this for a long posting list might be costly, to fix this we only approximate `T` as we go and only compare our current approximation against each postings `TF` component.
Notice every `u` postings we recompute our threshold. This is usually set to `128`. It affects how good our approximate value of `T` is.

Precomputing Score Contributions

For scoring algorithms that follow the bag-of-word paradigm, it is not necessary to compute the query term's score contributions at query time.
Instead, we may precompute each postings score contribution during index construction and store it in the index.
I.e., we could have postings of the form (doc_offset, score) insterad of (doc_offset, tf).
This can greatly reduce CPU cost during query processing.
Of course, if you do this, it is impossible to make any changes to the scoring functions after the index has been built.
Also, it makes it harder to use common index compression techniques which work well with tf values.
The book points out that rather than use floats, one can discretize scores to alleviate to some degree the second problem.

More Query Processing

Outline