Query Processing




CS267

Chris Pollett

Oct. 17, 2018

Outline

Query Processing

Query Processing for Ranked Retrieval

Okapi BM25

BM25 Example

In-Class Exericse

Suppose we have a query that contains 5 terms, each of which occurs in 1/64 fraction of all the documents in the corpus. By some miracle document `d` has exactly the average length and contains one occurrence of each of these terms.

  1. What would be the BM 25 score for document `d`?
  2. What if document `d` was twice average length?

Write up your derivation and the final answer and post them to the Oct 17 In-Class Exercise Thread.

Document-at-a-Time Query Processing

Binary Heaps

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    for(i =1 to k) {
       results[i].score := 0;
    }
    // create a min-heap for top k results
    for (i = 1 to n) { 
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc //establish heap for terms
    while (terms[0].nextDoc < infty) {
        d := terms[0].nextDoc;
        score := 0;
        while(terms[0].nextDoc == d) {
            t := terms[0].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[0].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[0].score) {
            results[0].docid := d;
            results[0].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.

MaxScore

Term-at-a-Time Query Processing

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of N[t[i]];
    acc := {}, acc' := {}; //initialize accumulators.
      //acc used for previous round, acc' for next
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'
        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy previous round to current for docs not containing t[i]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}