Introduction

On Monday, we started talking about query processing algorithms.
We introduced the BM25 document scoring function, which we said we would use because of its term frequency handling properties.
We said there were two main methods for doing query processing: document-at-a-time, in which scores for a document are computed completely before proceeding to the next document which matches at least some of the query terms, and term-at-a-time, in which we process each term in the query in turn adding to documents scores the score of that term.
We gave an initial `Theta(N_q cdot n + N_q cdot log N_q)` algorithm for document-at-a-time query processing and said it could be improved using priority queues.
We refreshed our memory on how heaps works (so have an implementation of priority queues).
We start today by looking document-at-a-time query processing using heaps then we look at term-at-a-time query processing.

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    // create a min-heap for top k results
    for(i = 1 to k) {
       results[i].score := 0;
    } 
    // create a min-heap for terms
    for (i = 1 to n) { 
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc x
    while (terms[1].nextDoc < infty) {
        d := terms[1].nextDoc;
        score := 0;
        while(terms[1].nextDoc == d) {
            t := terms[1].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[1].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[1].score) {
            results[1].docid := d;
            results[1].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.

MaxScore

Recall the term frequency contribution of BM25 can never exceed `k_1 + 1 = 2.2`.
So the overall score contribution of a terms `t` is bounded from above by `2.2cdot log(N/N_t)`.
This bound is called the MaxScore.
On the query (greek, philosophy, stoicism), the MaxScore's for these three terms might be `15.1`, `16.1`, and `28.9` respectively.
If we are only interested in the top 10 results, it might happen the lowest of the top 10 results we have seen so far has a score greater than `15.1`.
What this means is that if a document only contains the word greek and not the two other words it will never get into the top 10 results.
We can remove "greek" from the term heap. We still add the score of the greek contribution of a document, but only for documents coming from the other two terms in the heap.
When the `k`th best result so-far gets above 31.2 (maxscore of greek and philosophy), we can remove "philosophy" from the heap and only look at documents containing "stoicism".
This strategy is called MaxScore and in the book's tests triples the `k=10` query speed, making it take only 1.5 X the time of a conjunctive query.

Term-at-a-Time Query Processing

Instead of merging the query terms postings' lists by using a heap, you could imagine we score all documents for term 1, store the result into an array called an accumulator, then scan over the posting list of term 2 compute scores and add these to the scores in the accumulator and so on.
This approach to query processing is called term-at-a-time query processing.
The index is stored on disk and if we use document-at-a-time query processing we tend to access this disk in a non-sequential fashion.
On the other hand, posting lists tend to be sequentially laid out on the disk, so the seeks in doing term-at-a-time query processing should be faster.
Since term-at-a-time accesses each posting list separately, it is typically only used for scoring functions that are of the form:
`sc\o\re(q,d) = quality(d) + sum_(t in q) sc\o\re(t,d)`.
Here `quality(d)` is an optional query-independent score component such as PageRank.

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of N[t[i]];
    acc := {}, acc' := {}; //initialize accumulators.
      //acc used for previous round, acc' for next
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'
        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy previous round to current for docs not containing t[i]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos++].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

The worst case complexity of this algorithm is `Theta(N_q cdot n + N_q cdot log(k))`.
Notice the `n` rather than `log n`. This is caused because we traverse the entire accumulator set for every term `t`. So this is slightly slower than our heap-based document-at-a-time approach.

Accumulator Pruning

We now look at some strategies to speed up the term-at-a-time approach.
One problem with the algorithm so far, is that it assumes we can keep `a\c\c` in memory, which, given we are assuming the posting lists are long, seems questionable.
To fix this, we assume we have an upper bound `a_max` on the elements contained in the accumulator structures.
One strategy, called QUIT, to ensure we don't go above this bound is to check after each term if we have `|a\c\c| ge a_max`. If so, we stop and return the results we have.
Another strategy called, CONTINUE, is to continue processing postings lists if we find `|a\c\c| ge a_max` after processing a term, but never to add more accumulators.
Both these strategies are due to Moffat and Zobel (1996). Notice neither actually forces an upper bound on the memory used, since we do the check after a term is completely processed.
Lester et al (2005) propose a better pruning strategy which we now consider.

More Accumulator Pruning

In our algorithm so far we process terms from least to most frequent.
Given that we only have a limited number of accumulators, this means we will tend to spend them on documents which are likely to make the top `k`.
We want to devise a rule that tells us for a given posting whether it deserves its own accumulator or not.
Suppose we have processed `i-1` query terms, and have used `a_(c\u\r\r\e\n\t)` many accumulators. We are about to start processing `t_i`.
There are three possible situations:
1. `a_(c\u\r\r\e\n\t) + N_(t_i) le a_(max)`. In this case we have enough free accumulators for all of `t_i`'s postings, so no pruning is done.
2. `a_(c\u\r\r\e\n\t) = a_(max)`. In this case the accumulator limit has already been reached. So none of `t_i`'s posting will be allowed to create accumulators.
3. `a_(c\u\r\r\e\n\t) < a_(max) < a_(c\u\r\r\e\n\t) + N_(t_i)`. In this case, we may need to do pruning.
Ideally, we would like to give the `a_(max) - a_(c\u\r\r\e\n\t)` most important postings their own accumulator, but how to find them?

Assigning Accumulators

A possible two-pass approach to this problem is to: (1) make a pass and score contributions for all of `t_i` to find a threshold `T` such that only `a_(max) - a_(c\u\r\r\e\n\t)` docs have scores over this threshold. (2) Make a second pass and give anything above `T` its own accumulator.

To do this for a long posting list might be costly, to fix this we only approximate `T` as we go and only compare our current approximation against each postings `TF` component.

rankBM25_TermAtATimeWithPruning((t[1], t[2], ..., t[n]), k, amax, u) {
    // max_f is bounded above by a maximum number max_terms of terms allowed in a document.
    // (assume there is some doc length after which we truncate a document to that length)
    sort(t) in increasing order of N[[t[i]];
    acc := {}, acc' := {}; //initialize accumulators.
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        max_f = 0;
        quotaLeft := amax - length(acc) // the remaining accumulator quota
        if (N[t[i]] <= quotaLeft) { //plenty o' accumulators
            // do as we did in rankBM25_TermAtATime
            inPos := 0; //current pos in acc
            outPos := 0; // current position in acc'
            foreach document d in t[i]'s posting list do {
                while acc[inPos].docid < d do {
                    acc'[outPos++] := acc[inPos++]; 
                    //copy previous round to current for docs not containing t[i]
                }
                acc'[outPos].docId := d;
                acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d);
                if(acc[inPos].docid == d) {
                    acc'[outPos].score += acc[inPos++].score; 
                }
                outPos++;
            }
        } else if (quotaLeft == 0) { //no accumulators left
            for j:=1 to length(acc) do {
                 acc[j].score := acc[j].score + 
                     log(N/N[t[i]]) * TFBM25(t[i], acc[j].docid);
            }
        } else { //still have some accumulators
            for j:= 1 to max_terms do { tfStats[j] := 0} //initialize TF stats
            T = 1; //init threshold for new accumulators
            postingsSeen := 0;
            inPos := 0; //current pos in acc
            outPos := 0; // current position in acc'
            foreach document d in t[i]'s posting list do {
                while acc[inPos].docid < d do {
                    acc'[outPos++] := acc[inPos++]; 
                    //copy previous round to current for docs not containing t[i]
                }
                if(acc[inPos].docid == d) {
                    acc'[outPos].docid = d
                    acc'[outPos++].score = acc[inPos++].score + 
                         log(N/N[t[i]]) * TFBM25(t[i], d);
                } else if (quotaLeft > 0) {
                    if (f[t[i],d] ≥ T) { // if happens, make new accumlator
                        acc'[outPos].docid = d
                        acc'[outPos++].score = log(N/N[t[i]]) * TFBM25(t[i], d);
                        quotaLeft--;
                    }
                    tfStats[f[t[i],d]]++;
                    if (f[t[i],d] > max_f) {
                        max_f = f[t[i],d]; //update largest observed frequency
                    }
                }
                postingsSeen++;
                if (postingsSeen % u == 0) {
                     q := (N[t[i]] - postingsSeen)/postingsSeen;
                     T := argmin_x{x in Nat|
                         sum_(j=x)^{max_f}(tfStats[j] * q) < quotaLeft}
                } 
            }
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

Notice every `u` postings we recompute our threshold. This is usually set to `128`. It affects how good our approximate value of `T` is.

Precomputing Score Contributions

For scoring algorithms that follow the bag-of-word paradigm, it is not necessary to compute the query term's score contributions at query time.
Instead, we may precompute each postings score contribution during index construction and store it in the index.
I.e., we could have postings of the form (doc_offset, score) instead of (doc_offset, tf).
This can greatly reduce CPU cost during query processing.
Of course, if you do this, it is impossible to make any changes to the scoring functions after the index has been built.
Also, it makes it harder to use common index compression techniques which work well with tf values.
The book points out that rather than use floats, one can discretize scores to alleviate to some degree the second problem.

Light-Weight Structures

We are now going to look at lightweight structures.
We have already given an ADT for inverted indices and shown this could be used for phrase search as well as to compute simple relationships like this phrase is contained between these tags.
We now generalize this second query capability to region algebras.
Region algebras provide operators and functions for combining and manipulating text intervals in support of lightweight structure.
They represent an intermediate point between basic document retrieval and the complexity of full XML retrieval.

Concordance Lists

Region algebras typically work with sets of text intervals.
Any such interval can be expressed as a pair `[u, v]` indicating its start and end point.
We restrict our attention to sets of such intervals such that no interval in the set may have another interval from the set nested within it.
We call such a set of intervals a generalized concordance list or GC-list.
The name refers to paper-based concordances, alphabetical listings of words in a document along with the context in which they appeared.
Given two intervals `[u, v]` and `[u', v']` we write `[u, v] subset[u', v']` to indicate the first interval is a strict sub-interval of the latter.
As examples, {`[5,9]`, `[8, 12]`, `[15,20]`} is a GC-list, although `[5,9]` and `[8, 12]` overlap, but {`[1, 10]`, `[5, 9]`, `[8,12]`, `[15, 20]`} is not, because `[5, 9] subset [1, 10]`.
To make a GC-list out of a list of intervals we define the function `G(S)`:
`G(S) = {a | a in S mbox( and ) \neg exists b in S mbox( such that ) b subset a }`

Heap query processing, Accumulator Pruning, Concordance Lists

Outline