More Query Processing




CS267

Chris Pollett

Oct. 24, 2011

Outline

Introduction

MaxScore

Quiz

Which of the following is true?

  1. Both sort-based indexing and merge-based indexing might involve doing merging on disk.
  2. The contribution of an individual term in BM25 grows could be unbounded as a function of `f_(t,d)`.
  3. The worst-case query lookup when heaps are used is `Omega(N_q^2)` where `N_q` is the sum of the length of the posting lists of the query terms.

Term-at-a-Time Query Processing

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of `N_t_i`;
    acc := {}, acc' := {}; //initialize accumulators.
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'

        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy before first doc of t[i] the came from earlier t[j]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N_t) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

The worst case complexity of this algorithm is `Theta(N_q cdot n + N_q cdot log(k))`.

  • Notice the `n` rather than `log n`. This is caused because we traverse the entire accumulator set for every term `t`. So this is slightly slower than our heap-based document-at-a-time approach.
  • Accumulator Pruning

    More Accumulator Pruning

    Assigning Accumulators

    Precomputing Score Contributions