Max Score, Accumulator Pruning, Term-at-a-time




CS267

Chris Pollett

Oct. 24, 2011

Outline

Introduction

MaxScore

HW Problem

Exercise. Assume `k_1` and `b` are 1.2 and 0.75 respectively. Suppose we have a corpus of 25 million documents. The word "big" appears in 1 million, "mac" appears in 25 thousand, and "lots" appears in 10 thousand. The average document is 300 words long. Document 27 contains, the word "big" eight times, "mac" three times and "lots" once. It is 700 words long. Calculate the BM25 score for Document 27 for the queries "big lots" and "big mac".

Answer. From last day we have,
`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25)(t,d) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
The IDF scores for the three terms above are:
`IDF(\b\i\g) = log(frac(N)(N_(\b\i\g))) = log(frac(25 times 10^6)(10^6)) approx 4.64`,
`IDF(\m\a\c) = log(frac(N)(N_(\m\a\c))) = log(frac(25 times 10^6)(25 times 10^3)) approx 9.97`, and
`IDF(\l\o\t\s) = log(frac(N)(N_(\l\o\t\s))) = log(frac(25 times 10^6)(10^4)) approx 11.29`,
The term frequency scores for Document 27 (`d_(27)`) are:
`TF_(BM25)(t, d_(27)) = frac(f_(t,d_(27)) \cdot (k_1 +1))(f_(t,d_(27)) + k_1 cdot ((1-b) + b cdot(l_(d_(27)) / l_(avg)) ))`
`\qquad = frac(f_(t,d_(27)) \cdot (1.2 +1))(f_(t,d_(27)) + 1.2 cdot ((1 - 0.75 ) + 0.75 cdot(700 / 300)) )`
`\qquad = frac(f_(t,d_(27)) \cdot 2.2)(f_(t,d_(27)) + 2.4)`
`TF_(BM25)(\b\i\g, d_(27)) = frac(8 \cdot 2.2)(8 + 2.4) = 1.69`
`TF_(BM25)(\m\a\c, d_(27)) = frac(3 \cdot 2.2)(3 + 2.4) = 1.22`
`TF_(BM25)(\l\o\t\s, d_(27)) = frac(1 \cdot 2.2)(1 + 2.4) = 0.65`
So
`S\c\o\r\e_(BM25)(\mbox(big lots), d_(27)) = IDF(\b\i\g)cdot TF_(BM25)(\b\i\g, d_(27)) + IDF(\l\o\t\s)\cdot TF_(BM25)(\l\o\t\s, d_(27))`
`\qquad = 4.64 cdot 1.69 + 11.29 cdot 0.65 = 7.84 + 7.34 = 15.18`
and
`S\c\o\r\e_(BM25)(\mbox(big mac), d_(27)) = IDF(\b\i\g)cdot TF_(BM25)(\b\i\g, d_(27)) + IDF(\m\a\c)\cdot TF_(BM25)(\m\a\c, d_(27))`
`\qquad = 4.64 cdot 1.69 + 9.97 cdot 1.22 = 7.84 + 12.16 = 20.00`

Term-at-a-Time Query Processing

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of `N_(t_i)`;
    acc := {}, acc' := {}; //initialize accumulators.
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'

        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy before first doc of t[i] the came from earlier t[j]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N_t) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

The worst case complexity of this algorithm is `Theta(N_q cdot n + N_q cdot log(k))`.

  • Notice the `n` rather than `log n`. This is caused because we traverse the entire accumulator set for every term `t`. So this is slightly slower than our heap-based document-at-a-time approach.
  • Accumulator Pruning

    More Accumulator Pruning

    Assigning Accumulators

    Precomputing Score Contributions