Query Processing




CS267

Chris Pollett

Apr 5, 2021

Outline

Query Processing

Query Processing for Ranked Retrieval

Okapi BM25

BM25 Example

Quiz

Which of the following is true?

  1. The move-to-front heuristic is used as part of sort-based dictionary construction.
  2. In merge-based index construction, if the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition.
  3. A per-term index is an index with both a sort-based and hash based dictionary.

Document-at-a-Time Query Processing

Binary Heaps

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    // create a min-heap for top k results
    for(i =1 to k) {
       results[i].score := 0;
    } 
    // create a min-heap for top k results 
    for (i = 1 to n) { 
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc x
    while (terms[1].nextDoc < infty) {
        d := terms[1].nextDoc;
        score := 0;
        while(terms[1].nextDoc == d) {
            t := terms[1].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[1].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[1].score) {
            results[1].docid := d;
            results[1].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.

MaxScore

Term-at-a-Time Query Processing