CS267
Chris Pollett
Oct. 17, 2018
Suppose we have a query that contains 5 terms, each of which occurs in 1/64 fraction of all the documents in the corpus. By some miracle document `d` has exactly the average length and contains one occurrence of each of these terms.
Write up your derivation and the final answer and post them to the Oct 17 In-Class Exercise Thread.
We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:
rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) { for(i =1 to k) { results[i].score := 0; } // create a min-heap for top k results for (i = 1 to n) { terms[i].term := t[i]; terms[i].nextDoc = nextDoc(t[i], -infty); } sort terms in increasing order of nextDoc //establish heap for terms while (terms[0].nextDoc < infty) { d := terms[0].nextDoc; score := 0; while(terms[0].nextDoc == d) { t := terms[0].term; score += log(N/N_t)*TM_(BM25)(t,d); terms[0].nextDoc := nextDoc(t,d); REHEAP(terms); // restore heap property for terms; } if(score > results[0].score) { results[0].docid := d; results[0].score := score; REHEAP(results); // restore the heap property for results } } remove from results all items with score = 0; sort results in decreasing order of score; return results; }
The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.
rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) { sort(t) in increasing order of N[t[i]]; acc := {}, acc' := {}; //initialize accumulators. //acc used for previous round, acc' for next acc[0].docid := infty // end-of-list marker for i := 1 to n do { inPos := 0; //current pos in acc outPos := 0; // current position in acc' foreach document d in t[i]'s posting list do { while acc[inPos].docid < d do { acc'[outPos++] := acc[inPos++]; //copy previous round to current for docs not containing t[i] } acc'[outPos].docId := d; acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d); if(acc[inPos].docid == d) { acc'[outPos].score += acc[inPos].score; } outPos++; } while acc[inPos] < infty do { // copy remaining acc to acc' acc'[outPos++] := acc[inPos++]; } acc'[outPos].docid :=infty; //end-of-list-marker swap acc and acc' } return the top k items of acc; //select using heap }