Query Processing




CS267

Chris Pollett

Oct. 15, 2012

Outline

Query Processing

Query Processing for Ranked Retrieval

Quiz

Which of the following is true?

  1. Synchronization points are typically stored in the index dictionary.
  2. With the move-to-front heuristic one can reduce the table-size, excluding chains, of a hash table implementation of an inverted index dictionary, yet still achieve good performance.
  3. Merge-based Index construction makes use of n-way disk merge sorts.

HW Problem

Exercise 4.2. Implement a fake posting list lookup in two different ways to see if a per-term index can help even when the posting lists are in RAM. In the first method assume have n, 32-bit postings. Do a lookup using binary search. In the second method, have a per-term index every 64 posting. Do a binary search on the per-term index, then sequential scan of the remainder. Test `n=2^(12)`, `2^(16)`, `2^(20)`, `2^(24)` and analyze your findings.

Answer. I didn't actually get around to implementing this, but let me tell you roughly what I would expect. In terms of comparisons, method 1 would require respectively, `12`, `16`, `20`, and `24` comparisons on average to do the lookup. For method 2, the binary search part would take `log(n/2^6)` comparisons followed by `32` comparisons on average for the sequential lookup. So the number of comparisons on average would be `38`, `42`, `46`, `50`, respectively. This is not the whole story though because of the way that memory is read into the CPU cache. Some description of this is give in Appendix A.2. Memory is read from RAM into the cache in lines of 64 bytes. So if each binary search comparison resulted in a cache miss we would expect time still proportional to the number of comparisons/CPU line lookups in the first case, but in the second case it would be proportional to `6 + 2 = 8`, `10 + 2 = 12`, `14 + 2 = 16`, and `18 + 2 = 20` CPU reads. The `2` is because the expected `32` postings scanned would require 2 line lookups.

Okapi BM25

BM25 Example

Document-at-a-Time Query Processing

Binary Heaps

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    for(i =1 to k) {
       results[i].score := 0;
    }
    for (i = 1 to n) { // create a min-heap for top k results
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc //establish heap for terms
    while (terms[0].next < infty) {
        d := terms[0].nextDoc;
        score := 0;
        while(terms[0].nextDoc == d) {
            t := terms[0].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[0].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[0].score) {
            results[0].docid := d;
            results[0].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(m cdot log(n) + m \cdot log(k))`.