Introduction

On Monday, we started talking about query processing algorithms.
We said there were two main methods for doing query processing: document-at-a-time, in which scores for a document are computed completely before proceeding to the next document which matches at least some of the query terms, and term-at-a-time, in which we process each term in the query in turn adding to documents scores the score of that term.
We saw how using heaps for the nearest nextDoc and for the top `k` results, can speed up the processing of document-at-a-time processing.
We saw how using the Max Score heuristic can speed up the processing of top `k` results when doing document at a time processing, by allowing us to eliniate terms from our term heap.
We begin today with term-at-a-time processing and algorithms for it.

Term-at-a-Time Query Processing

Instead of merging the query terms postings' lists by using a heap, you could imagine we score all documents for term 1, store the result into an array called an accumulator, then scan over the posting list of term 2 compute scores and add these to the scores in the accumulator and so on.
This approach to query processing is called term-at-a-time query processing.
The index is stored on disk and if we use document-at-a-time query processing we tend to access this disk in a non-sequential fashion.
On the other hand, posting lists tend to be sequentially laid out on the disk, so the seeks in doing term-at-a-time query processing should be faster.
Since term-at-a-time accesses each posting list separately, it is typically only used for scoring functions that are of the form:
`sc\o\re(q,d) = quality(d) + sum_(t in q) sc\o\re(t,d)`.
Here `quality(d)` is an optional query-independent score component such as PageRank.

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of N[t[i]];
    acc := {}, acc' := {}; //initialize accumulators.
      //acc used for previous round, acc' for next
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'
        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy previous round to current for docs not containing t[i]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos++].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

The worst case complexity of this algorithm is `Theta(N_q cdot n + N_q cdot log(k))`.
Notice the `n` rather than `log n`. This is caused because we traverse the entire accumulator set for every term `t`. So this is slightly slower than our heap-based document-at-a-time approach.

Accumulator Pruning

We now look at some strategies to speed up the term-at-a-time approach.
One problem with the algorithm so far, is that it assumes we can keep `a\c\c` in memory, which, given we are assuming the posting lists are long, seems questionable.
To fix this, we assume we have an upper bound `a_max` on the elements contained in the accumulator structures.
One strategy, called QUIT, to ensure we don't go above this bound is to check after each term if we have `|a\c\c| ge a_max`. If so, we stop and return the results we have.
Another strategy called, CONTINUE, is to continue processing postings lists if we find `|a\c\c| ge a_max` after processing a term, but never to add more accumulators.
Both these strategies are due to Moffat and Zobel (1996). Notice neither actually forces an upper bound on the memory used, since we do the check after a term is completely processed.
Lester et al (2005) propose a better pruning strategy which we now consider.

More Accumulator Pruning

In our algorithm so far we process terms from least to most frequent.
Given that we only have a limited number of accumulators, this means we will tend to spend them on documents which are likely to make the top `k`.
We want to devise a rule that tells us for a given posting whether it deserves its own accumulator or not.
Suppose we have processed `i-1` query terms, and have used `a_(c\u\r\r\e\n\t)` many accumulators. We are about to start processing `t_i`.
There are three possible situations:
1. `a_(c\u\r\r\e\n\t) + N_(t_i) le a_(max)`. In this case we have enough free accumulators for all of `t_i`'s postings, so no pruning is done.
2. `a_(c\u\r\r\e\n\t) = a_(max)`. In this case the accumulator limit has already been reached. So none of `t_i`'s posting will be allowed to create accumulators.
3. `a_(c\u\r\r\e\n\t) < a_(max) < a_(c\u\r\r\e\n\t) + N_(t_i)`. In this case, we may need to do pruning.
Ideally, we would like to give the `a_(max) - a_(c\u\r\r\e\n\t)` most important postings their own accumulator, but how to find them?

Assigning Accumulators

A possible two-pass approach to this problem is to: (1) make a pass and score contributions for all of `t_i` to find a threshold `T` such that only `a_(max) - a_(c\u\r\r\e\n\t)` docs have scores over this threshold. (2) Make a second pass and give anything above `T` its own accumulator.

To do this for a long posting list might be costly, to fix this we only approximate `T` as we go and only compare our current approximation against each postings `TF` component.

rankBM25_TermAtATimeWithPruning((t[1], t[2], ..., t[n]), k, amax, u) {
    // max_f is bounded above by a maximum number max_terms of terms allowed in a document.
    // (assume there is some doc length after which we truncate a document to that length)
    sort(t) in increasing order of N[[t[i]];
    acc := {}, acc' := {}; //initialize accumulators.
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        max_f = 0;
        quotaLeft := amax - length(acc) // the remaining accumulator quota
        if (N[t[i]] <= quotaLeft) { //plenty o' accumulators
            // do as we did in rankBM25_TermAtATime
            inPos := 0; //current pos in acc
            outPos := 0; // current position in acc'
            foreach document d in t[i]'s posting list do {
                while acc[inPos].docid < d do {
                    acc'[outPos++] := acc[inPos++]; 
                    //copy previous round to current for docs not containing t[i]
                }
                acc'[outPos].docId := d;
                acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d);
                if(acc[inPos].docid == d) {
                    acc'[outPos].score += acc[inPos++].score; 
                }
                outPos++;
            }
        } else if (quotaLeft == 0) { //no accumulators left
            for j:=1 to length(acc) do {
                 acc[j].score := acc[j].score + 
                     log(N/N[t[i]]) * TFBM25(t[i], acc[j].docid);
            }
        } else { //still have some accumulators
            for j:= 1 to max_terms do { tfStats[j] := 0} //initialize TF stats
            T = 1; //init threshold for new accumulators
            postingsSeen := 0;
            inPos := 0; //current pos in acc
            outPos := 0; // current position in acc'
            foreach document d in t[i]'s posting list do {
                while acc[inPos].docid < d do {
                    acc'[outPos++] := acc[inPos++]; 
                    //copy previous round to current for docs not containing t[i]
                }
                if(acc[inPos].docid == d) {
                    acc'[outPos].docid = d
                    acc'[outPos++].score = acc[inPos++].score + 
                         log(N/N[t[i]]) * TFBM25(t[i], d);
                } else if (quotaLeft > 0) {
                    if (f[t[i],d] ≥ T) { // if happens, make new accumlator
                        acc'[outPos].docid = d
                        acc'[outPos++].score = log(N/N[t[i]]) * TFBM25(t[i], d);
                        quotaLeft--;
                    }
                    tfStats[f[t[i],d]]++;
                    if (f[t[i],d] > max_f) {
                        max_f = f[t[i],d]; //update largest observed frequency
                    }
                }
                postingsSeen++;
                if (postingsSeen % u == 0) {
                     q := (N[t[i]] - postingsSeen)/postingsSeen;
                     T := argmin_x{x in Nat|
                         sum_(j=x)^{max_f}(tfStats[j] * q) < quotaLeft}
                } 
            }
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

Notice every `u` postings we recompute our threshold. This is usually set to `128`. It affects how good our approximate value of `T` is.

Precomputing Score Contributions

For scoring algorithms that follow the bag-of-word paradigm, it is not necessary to compute the query term's score contributions at query time.
Instead, we may precompute each postings score contribution during index construction and store it in the index.
I.e., we could have postings of the form (doc_offset, score) instead of (doc_offset, tf).
This can greatly reduce CPU cost during query processing.
Of course, if you do this, it is impossible to make any changes to the scoring functions after the index has been built.
Also, it makes it harder to use common index compression techniques which work well with tf values.
The book points out that rather than use floats, one can discretize scores to alleviate to some degree the second problem.

Light-Weight Structures

We are now going to look at lightweight structures.
We have already given an ADT for inverted indices and shown this could be used for phrase search as well as to compute simple relationships like this phrase is contained between these tags.
We now generalize this second query capability to region algebras.
Region algebras provide operators and functions for combining and manipulating text intervals in support of lightweight structure.
They represent an intermediate point between basic document retrieval and the complexity of full XML retrieval.

Concordance Lists

Region algebras typically work with sets of text intervals.
Any such interval can be expressed as a pair `[u, v]` indicating its start and end point.
We restrict our attention to sets of such intervals such that no interval in the set may have another interval from the set nested within it.
We call such a set of intervals a generalized concordance list or GC-list.
The name refers to paper-based concordances, alphabetical listings of words in a document along with the context in which they appeared.
Given two intervals `[u, v]` and `[u', v']` we write `[u, v] subset[u', v']` to indicate the first interval is a strict sub-interval of the latter.
As examples, {`[5,9]`, `[8, 12]`, `[15,20]`} is a GC-list, although `[5,9]` and `[8, 12]` overlap, but {`[1, 10]`, `[5, 9]`, `[8,12]`, `[15, 20]`} is not, because `[5, 9] subset [1, 10]`.
To make a GC-list out of a list of intervals we define the function `G(S)`:
`G(S) = {a | a in S mbox( and ) \neg exists b in S mbox( such that ) b subset a }`

Accumulator Pruning, Concordance Lists

Outline

Introduction

Term-at-a-Time Query Processing

Term-at-a-Time Algorithm

Accumulator Pruning

More Accumulator Pruning

Assigning Accumulators

Precomputing Score Contributions

Light-Weight Structures

Concordance Lists

In-Class Exercise