CS267
Chris Pollett
Oct. 17, 2018
Suppose we have a query that contains 5 terms, each of which occurs in 1/64 fraction of all the documents in the corpus. By some miracle document `d` has exactly the average length and contains one occurrence of each of these terms.
Write up your derivation and the final answer and post them to the Oct 17 In-Class Exercise Thread.
We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:
rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
for(i =1 to k) {
results[i].score := 0;
}
// create a min-heap for top k results
for (i = 1 to n) {
terms[i].term := t[i];
terms[i].nextDoc = nextDoc(t[i], -infty);
}
sort terms in increasing order of nextDoc //establish heap for terms
while (terms[0].nextDoc < infty) {
d := terms[0].nextDoc;
score := 0;
while(terms[0].nextDoc == d) {
t := terms[0].term;
score += log(N/N_t)*TM_(BM25)(t,d);
terms[0].nextDoc := nextDoc(t,d);
REHEAP(terms); // restore heap property for terms;
}
if(score > results[0].score) {
results[0].docid := d;
results[0].score := score;
REHEAP(results); // restore the heap property for results
}
}
remove from results all items with score = 0;
sort results in decreasing order of score;
return results;
}
The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.
rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
sort(t) in increasing order of N[t[i]];
acc := {}, acc' := {}; //initialize accumulators.
//acc used for previous round, acc' for next
acc[0].docid := infty // end-of-list marker
for i := 1 to n do {
inPos := 0; //current pos in acc
outPos := 0; // current position in acc'
foreach document d in t[i]'s posting list do {
while acc[inPos].docid < d do {
acc'[outPos++] := acc[inPos++];
//copy previous round to current for docs not containing t[i]
}
acc'[outPos].docId := d;
acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d);
if(acc[inPos].docid == d) {
acc'[outPos].score += acc[inPos].score;
}
outPos++;
}
while acc[inPos] < infty do { // copy remaining acc to acc'
acc'[outPos++] := acc[inPos++];
}
acc'[outPos].docid :=infty; //end-of-list-marker
swap acc and acc'
}
return the top k items of acc; //select using heap
}