Heap query processing, Accumulator Pruning, Concordance Lists




CS267

Chris Pollett

Oct 25, 2023

Outline

Introduction

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    // create a min-heap for top k results
    for(i = 1 to k) {
       results[i].score := 0;
    } 
    // create a min-heap for terms
    for (i = 1 to n) { 
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc x
    while (terms[1].nextDoc < infty) {
        d := terms[1].nextDoc;
        score := 0;
        while(terms[1].nextDoc == d) {
            t := terms[1].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[1].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[1].score) {
            results[1].docid := d;
            results[1].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.

MaxScore

Term-at-a-Time Query Processing

Term-at-a-Time Algorithm

rankBM25_TermAtATime((t[1], t[2], ..., t[n]), k) {
    sort(t) in increasing order of N[t[i]];
    acc := {}, acc' := {}; //initialize accumulators.
      //acc used for previous round, acc' for next
    acc[0].docid := infty // end-of-list marker
    for i := 1 to n do {
        inPos := 0; //current pos in acc
        outPos := 0; // current position in acc'
        foreach document d in t[i]'s posting list do {
            while acc[inPos].docid < d do {
                acc'[outPos++] := acc[inPos++]; 
                //copy previous round to current for docs not containing t[i]
            }
            acc'[outPos].docId := d;
            acc'[outPos].score := log(N/N[t[i]]) * TFBM25(t[i], d);
            if(acc[inPos].docid == d) {
                acc'[outPos].score += acc[inPos++].score; 
            }
            outPos++;
        }
        while acc[inPos] < infty do { // copy remaining acc to acc'
            acc'[outPos++] := acc[inPos++];
        }
        acc'[outPos].docid :=infty; //end-of-list-marker
        swap acc and acc'
    }
    return the top k items of acc; //select using heap
}

Accumulator Pruning

More Accumulator Pruning

Assigning Accumulators

Precomputing Score Contributions

Light-Weight Structures

Concordance Lists

In-Class Exercise

Given a list of ordered pairs, `S`, suggest pseudo-code to compute `G(S)`. What is the runtime of your code?

Post your solutions to the Oct 25 In-Class Exercise.