Introduction

On Monday, we were looking at document-at-a-time query processing.
We were in the disjunctive model: We might return any document that contains at least one of the query terms.
We cycled through the documents that contain at least one query term, assign a score using BM25, and return the top `k` scores.
The last thing we looked at was how to improve the speed of this using heaps.
We begin today by looking at one more improvement and then start to discuss how to implement term at time query processing.

MaxScore

Recall from last day the term frequency contribution of BM25 can never exceed `k_1 + 1 = 2.2`.
So the overall score contribution of a terms `t` is bounded from above by `2.2cdot log(N/N_t)`.
This bound is called the MaxScore.
On the query (greek, philosophy, stoicism), the MaxScore's for these three terms might be `15.1`, `16.1`, and `28.9` respectively.
If we are only interested in the top 10 results, it might happen the lowest of the top 10 results we have seen so far has a score greater than `15.1`.
What this means is that if a document only contains the word greek and not the two other words it will never get into the top 10 results.
We can remove "greek" from the term heap. We still add the score of the greek contribution of a document, but only for documents coming from the other two heaps.
When the 10th best result so-far gets above 31.2, we can remove "philosophy" from the heap and only look at documents containing "stoicism".
This strategy is called MaxScore and in the book's tests triples the `k=10` query speed, making it take only 1.5 X the time of a conjunctive query.

HW Problem

Exercise. Assume `k_1` and `b` are 1.2 and 0.75 respectively. Suppose we have a corpus of 25 million documents. The word "big" appears in 1 million, "mac" appears in 25 thousand, and "lots" appears in 10 thousand. The average document is 300 words long. Document 27 contains, the word "big" eight times, "mac" three times and "lots" once. It is 700 words long. Calculate the BM25 score for Document 27 for the queries "big lots" and "big mac".

Answer. From last day we have,
`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25)(t,d) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
The IDF scores for the three terms above are:
`IDF(\b\i\g) = log(frac(N)(N_(\b\i\g))) = log(frac(25 times 10^6)(10^6)) approx 4.64`,
`IDF(\m\a\c) = log(frac(N)(N_(\m\a\c))) = log(frac(25 times 10^6)(25 times 10^3)) approx 9.97`, and
`IDF(\l\o\t\s) = log(frac(N)(N_(\l\o\t\s))) = log(frac(25 times 10^6)(10^4)) approx 11.29`,
The term frequency scores for Document 27 (`d_(27)`) are:
`TF_(BM25)(t, d_(27)) = frac(f_(t,d_(27)) \cdot (k_1 +1))(f_(t,d_(27)) + k_1 cdot ((1-b) + b cdot(l_(d_(27)) / l_(avg)) ))`
`\qquad = frac(f_(t,d_(27)) \cdot (1.2 +1))(f_(t,d_(27)) + 1.2 cdot ((1 - 0.75 ) + 0.75 cdot(700 / 300)) )`
`\qquad = frac(f_(t,d_(27)) \cdot 2.2)(f_(t,d_(27)) + 2.4)`
`TF_(BM25)(\b\i\g, d_(27)) = frac(8 \cdot 2.2)(8 + 2.4) = 1.69`
`TF_(BM25)(\m\a\c, d_(27)) = frac(3 \cdot 2.2)(3 + 2.4) = 1.22`
`TF_(BM25)(\l\o\t\s, d_(27)) = frac(1 \cdot 2.2)(1 + 2.4) = 0.65`
So
`S\c\o\r\e_(BM25)(\mbox(big lots), d_(27)) = IDF(\b\i\g)cdot TF_(BM25)(\b\i\g, d_(27)) + IDF(\l\o\t\s)\cdot TF_(BM25)(\l\o\t\s, d_(27))`
`\qquad = 4.64 cdot 1.69 + 11.29 cdot 0.65 = 7.84 + 7.34 = 15.18`
and
`S\c\o\r\e_(BM25)(\mbox(big mac), d_(27)) = IDF(\b\i\g)cdot TF_(BM25)(\b\i\g, d_(27)) + IDF(\m\a\c)\cdot TF_(BM25)(\m\a\c, d_(27))`
`\qquad = 4.64 cdot 1.69 + 9.97 cdot 1.22 = 7.84 + 12.16 = 20.00`

Term-at-a-Time Query Processing

Instead of merging the query terms postings' lists by using a heap, you could imagine we score all documents for term 1, store the result into an array called an accumulator, then scan over the posting list of term 2 compute scores and add these to the scores in the accumulator and so on.
This approach to query processing is called term-at-a-time query processing.
The index is stored on disk and if we use document-at-a-time query processing we tend to access this disk in a non-sequential fashion.
On the other hand, posting lists tend to be sequentially laid out on the disk, so the seeks in doing term-at-a-time query processing should be faster.
Since term-at-a-time accesses each posting list separately, it is typically only used for scoring functions that are of the form:
`sc\o\re(q,d) = quality(d) + sum_(t in q) sc\o\re(t,d)`.
Here `quality(d)` is an optional query-independent score component such as PageRank.

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of `N_(t_i)`;
    acc := {}, acc' := {}; //initialize accumulators.
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'

        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy before first doc of t[i] the came from earlier t[j]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N_t) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

The worst case complexity of this algorithm is `Theta(N_q cdot n + N_q cdot log(k))`.

Notice the `n` rather than `log n`. This is caused because we traverse the entire accumulator set for every term `t`. So this is slightly slower than our heap-based document-at-a-time approach.

Accumulator Pruning

We now look at some strategies to speed up the term-at-a-time approach.
One problem with the algorithm so far, is that it assumes we can keep `a\c\c` in memory, which, given we are assuming the the posting lists are long, seems questionable.
To fix this, we assume we have an upper bound `a_max` on the elements contained in the accumulator structures.
One strategy, called QUIT, to ensure we don't go above this bound is to check after each term if we have `|a\c\c| ge a_max`. If so, we stop and return the results we have.
Another strategy called, CONTINUE, is to continue processing postings lists if we find `|a\c\c| ge a_max` after processing a term, but never to add more accumulators.
Both these strategies are due to Moffat and Zobel (1996). Notice neither actually forces an upper bound on the memory used, since we do the check after a term is completely processed.
Lester et al (2005) propose a better pruning strategy which we now consider.

More Accumulator Pruning

In our algorithm so far we process terms from least to most frequent.
Given that we only have a limited number of accumulators, this means we will tend to spend them on documents which are likely to make the top `k`.
We want to devise a rule that tells us for a given posting whether it deserves its own accumulator or not.
Suppose we have processed `i-1` query terms, and have used `a_(c\u\r\r\e\n\t)` many accumulators. We are about to start processing `t_i`.
There are three possible situations:
1. `a_(c\u\r\r\e\n\t) + N_(t_i) le a_(max)`. In this case we have enough free accumulators for all of `t_i`'s postings, so no pruning is done.
2. `a_(c\u\r\r\e\n\t) = a_(max)`. In this case the accumulator limit has already been reached. So none of `t_i`'s posting will be allowed to create accumulators.
3. `a_(c\u\r\r\e\n\t) < a_(max) < a_(c\u\r\r\e\n\t) + N_(t_i)`. In this case, we may need to do pruning.
Ideally, we would like to give the `a_(max) - a_(c\u\r\r\e\n\t)` most important postings their own accumulator, but how to find them?

Assigning Accumulators

A possible two-pass approach to this problem is to: (1) make a pass and score contributions for all of `t_i` to find a threshold `T` such that only `a_(max) - a_(c\u\r\r\e\n\t)` docs have scores over this threshold. (2) Make a second pass and give anything above `T` its own accumulator.
To do this for a long posting list might be costly, to fix this we only approximate `T` as we go and only compare our current approximation against each postings `TF` component.
Notice every `u` postings we recompute our threshold. This is usually set to `128`. It affects how good our approximate value of `T` is.

Precomputing Score Contributions

For scoring algorithms that follow the bag-of-word paradigm, it is not necessary to compute the query term's score contributions at query time.
Instead, we may precompute each postings score contribution during index construction and store it in the index.
I.e., we could have postings of the form (doc_offset, score) instead of (doc_offset, tf).
This can greatly reduce CPU cost during query processing.
Of course, if you do this, it is impossible to make any changes to the scoring functions after the index has been built.
Also, it makes it harder to use common index compression techniques which work well with tf values.
The book points out that rather than use floats, one can discretize scores to alleviate to some degree the second problem.

Max Score, Accumulator Pruning, Term-at-a-time

Outline