Introduction

Before the break we were looking at techniques to build inverted indexes.
We first focused on sort-based and hash-based dictionary construction techniques.
We looked at the dictionary-as-a-string approach to sort-based dictionaries, and various heuristics for collision resolution hash-based index construction such as move-to-front and insert-at-back.
We talked about ways to lay out posting lists to make it easier to scan for a particular offset such as the use of per-term indexes.
We then went over a sort-based, static, inverted index construction algorithm, where as we process documents we emit pairs (termID, offset), once we've processed all the documents, we sort the pairs by termID breaking ties by the offset, compressing offset to posting lists as we go.
If the pairs are stored on disk, then to do the sorting we use an n-way merge sort, where n is the number of RAM blocks we have for sorting.
I.e., we read first read in n-blocks at a time, sort, and write-back to disk, to create sorted runs of length n. We then merge n-1 such runs at a time (have a block in memory from each run + an output block) to make runs of length `n\cdot(n-1)` and keep going till sorted.
We begin today by describing a second approach to disk-based index construction directly using merging.

Merge-based Index Construction

This approach is direct extension of our in-memory hash-based index approach.
If the index is small enough to fit in memory the two approaches will be the same.
If the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition.
The in-memory index is wiped, and we continue indexing as if from scratch.
After going through the collection one has a sequence of partitions on disk.
Terms in this set-up are their own ID, posting lists in each partition are sorted in lexicographical order of their terms.
The final stage of the algorithm is to then merge each of the partitions into the final index.

Merge-Based Index Pseudocode

buildIndex_mergeBase(inputTokenizer, memoryLimit) 
{
    n := 0;
    position := 0;
    memoryConsumption := 0;
    while (inputTokenizer.hasNext()) {
        T := inputTokenizer.getNext();
        obtain dictionary entry for T;
        create new entry if necessary;
        append new position to T's posting list
        position++;
        memoryConsumption++;
        if (memoryConsumption > memoryLimit) {
            createIndexPartition();
        }
    }
    if (memoryConsumption > 0) {
        createIndexPartition();
    }
    merge index partitions I[0],...,I[n-1]
        to make final index I_final;
}

createIndexPartition()
{
    create empty on disk inverted file I[n];
    sort in-memory dictionary entries in lex order;
    for each term T in dictionary {
        add T's posting list to I[n];
    }
    delete all in memory posting lists;
    write the dictionary to disk
    reset the in-memory dictionary;
    memoryConsumption := 0;
    n++;
}

mergeIndexPartitions([I[0], ..., I[n-1]])
{
    create empty Inverted File I_final;
    for (k = 0; k < n; k++) {
        open partition I[k] for sequential processing;
    }
    currentIndex := I[0];// anything other than nil so go through loop once
    while (currentIndex != nil) {
        currentIndex = nil;
        for (k = 0; k < n; k++) {
            if (I[k] still has terms left) {
                if (currentIndex == nil || 
                    I[k].currentTerm < currentTerm) {
                    currentIndex := I[k];
                    currentTerm := I[k].currentTerm;
                }     
            }
        }
        if (currentIndex != nil) {
            I_final.addPostings(currentTerm,
                currentIndex.getPostings(currentTerm));
            currentIndex.advanceToNextTerm();
        }
    }
    delete I[0], ..., I[n-1];
}

Remarks on Merge Algorithm

The algorithm takes time which grows only slightly more than linearly in the size of the collection
You need to be able to keep at least a few pages from each partition in memory at a time, so your RAM limits the total size of the collection you can index.
Even if you can keep one page of each partition in memory, being able to keep more will often subtantially improve your performance.

Query Processing

We now look at methods to implement query processing using these data structures.
Many IR systems implement some kind of Boolean Model for retrieval.
We will look at ranked retrieval and lightweight structure (ex: all documents which have "drug" and "aspirin" within 10 words) as retrieval methods.
For now, we will be more interested in the efficiency of retrieval over effectiveness measures of results.

Query Processing for Ranked Retrieval

Boolean and Ranked retrieval are often complementary: One typically, determines a set of documents using a Boolean retrieval method and then rank those documents according to some ranking method.
Given a query vector ("greek", "philosophy", "stoicism"), there are two natural ways one could interpret the query: We are looking for the documents that have all of these terms; or we are looking for the documents that have any of these terms, but more terms are better than less.
The former case, called conjunctive query processing, is favored by search engines; the latter, called disjunctive query processing, by more traditional IR systems.
In TREC 45, according to the book, 7834 documents match the disjunctive interpretation, only a single document matches the conjunctive version.
The one document match was about a Mexican actor's film projects so not relevant.
Many of the disjunctive results were relevant. One reason for this is that if you are talking about stoicism, a reader can already be presumed to know that it is both a philosophy and it originated in Greece.
The book prefers disjunctive query processing, for the above, and because the authors feel that adding terms to the query should not hurt search results.

Okapi BM25

Once we have retrieved our documents, we need to score and sort them.
For now, we will assume we are doing this using Okapi BM25 (BM25 for short).
Later in the semester we show how sophisticated scoring functions can be derived. For BM25, we are just going to define it and give some intuition. First, the definition:
`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
The two free parameters in the above `k_1` and `b` are usually set of `1.2` and `0.75` respectively.
`b` controls the degree of document length normalization.
`k_1` caps the score contribution of an individual query term as:
`lim_(f_(t,d) -> infty) TF_(BM25)(t,d) = k_1 +1`.

BM25 Example

Suppose we have a two term query `q = (t_1, t_2)` where `N_(t_1) approx N_(t_2)`, and we are scoring two documents `d_1` and `d_2` of average length `l_(d_1) = l_(d_2) =l_(avg)`. Suppose `d_1` contains one occurrence of each of `t_1` and `t_2`, while `d_2` contains `10` occurrences of `t_1` and no occurrences of `t_2`.
Then we have: `S\c\o\r\e_(BM25)(q, d_1) = log(frac(N)(N_(t_1))) cdot (2 cdot frac(1 cdot (k_1 + 1))(1 + k_1)) approx 2 cdot log(frac(N)(N_(t_1)))`
`S\c\o\r\e_(BM25)(q, d_2) = log(frac(N)(N_(t_1))) cdot (1 cdot frac(10 cdot (k_1 + 1))(10 + k_1)) approx 1.95 cdot log(frac(N)(N_(t_1)))`
assuming `k_1 = 1.2`.
Later, when doing disjunctive query processing, we will use the upper bounds provided by `k_1` to ignore postings for which we know a priori that they cannot push the corresponding document into the top of search results.

Quiz

Which of the following is true?

The insert-at-back heuristic is used as part of hash-based dictionary construction.
Hash-based dictionaries are better if we want to support prefix queries.
A per-term index is a B-tree used for posting lists.

Document-at-a-Time Query Processing

Document-at-a-Time Query Processing is the most popular form of query processing for ranked retrieval.
In this method all matching documents are enumerated, one after the other, and a score is computed for each.
Once all documents have been processed, the documents are sorted according to their score and the top `k` results are returned.
To use BM25 in such an algorithm, we can take our code for rankCosine and simply replace the line where the score is calculated with the BM25 score calculation.
The overall time complexity of this algorithm is `Theta(m cdot n + m cdot log m)` where `n` is the number of query terms; `m` is the number of documents returned.
The `m cdot n` comes from the while loop, the `m cdot log m` from doing a sort of the result.
Notice sorting all `m` results is a waste if we are only interested in returning the top `k`.
Also in computing the score as well as the min over nextDoc, we need to iterate over all terms in the query regardless of whether the document has them or not.
We can use heaps to address these two problems.
In order, to get a bound on the resulting algorithm we introduce a little notation.
Let `N_q = N_{t_1} + N_{t_2} cdots + N_{t_n}`. Then `m` is somewhere between `N_q/n` and `N_q` and we also get the run time bound `Theta(N_q cdot n + N_q cdot log N_q)`. In practice, it will tend to be the case that a document will contain only one of the query terms and so `m` will tend to be closer to `N_q` than `N_q/n`.

Binary Heaps

Recall a (binary) heap is a binary tree that satisfies: (1) the empty binary tree is a heap, (2) a non-empty binary tree `T` is a heap, if (a) `T` is completely filled on all levels, except the deepest one, (b) `T`'s deepest level is filled from left to right, (c) For each node `v` in `T`, the value stored in `v` is smaller than the value stored in any of its children.
Heaps can be represented either using trees or using arrays, the latter almost always being used as it is faster.
Heaps support an operation called REHEAP where we take the root value of the tree and replace it with a new value and push the new value down the heap and until the heap property is restored.
REHEAP can be done in `O(log n)` steps.

Query Processing with Heaps

We can overcome the two limitations of our first algorithm for ranked retrieval by using two heaps: one to manage the query terms and, for each term t, keep track of the next document that contains t; the other one to maintain the set of the top `k` search results seen so far:

rankBM25_DocumentAtATime_WithHeaps((t[1], .. t[n]), k) {
    // create a min-heap for top k results
    for(i = 1 to k) {
       results[i].score := 0;
    } 
    // create a min-heap for terms
    for (i = 1 to n) { 
        terms[i].term := t[i];
        terms[i].nextDoc = nextDoc(t[i], -infty);
    }
    sort terms in increasing order of nextDoc x
    while (terms[1].nextDoc < infty) {
        d := terms[1].nextDoc;
        score := 0;
        while(terms[1].nextDoc == d) {
            t := terms[1].term;
            score += log(N/N_t)*TM_(BM25)(t,d);
            terms[1].nextDoc := nextDoc(t,d);
            REHEAP(terms); // restore heap property for terms;
        }
        if(score > results[1].score) {
            results[1].docid := d;
            results[1].score := score;
            REHEAP(results); // restore the heap property for results
        }
    }
    remove from results all items with score = 0;
    sort results in decreasing order of score;
    return results;
}

The complexity of this algorithm is `Theta(N_q cdot log(n) + N_q \cdot log(k))`.

MaxScore

Recall the term frequency contribution of BM25 can never exceed `k_1 + 1 = 2.2`.
So the overall score contribution of a terms `t` is bounded from above by `2.2cdot log(N/N_t)`.
This bound is called the MaxScore.
On the query (greek, philosophy, stoicism), the MaxScore's for these three terms might be `15.1`, `16.1`, and `28.9` respectively.
If we are only interested in the top 10 results, it might happen the lowest of the top 10 results we have seen so far has a score greater than `15.1`.
What this means is that if a document only contains the word greek and not the two other words it will never get into the top 10 results.
We can remove "greek" from the term heap. We still add the score of the greek contribution of a document, but only for documents coming from the other two terms in the heap.
When the `k`th best result so-far gets above 31.2 (maxscore of greek and philosophy), we can remove "philosophy" from the heap and only look at documents containing "stoicism".
This strategy is called MaxScore and in the book's tests triples the `k=10` query speed, making it take only 1.5 X the time of a conjunctive query.

Term-at-a-Time Query Processing

Instead of merging the query terms postings' lists by using a heap, you could imagine we score all documents for term 1, store the result into an array called an accumulator, then scan over the posting list of term 2 compute scores and add these to the scores in the accumulator and so on.
This approach to query processing is called term-at-a-time query processing.
The index is stored on disk and if we use document-at-a-time query processing we tend to access this disk in a non-sequential fashion.
On the other hand, posting lists tend to be sequentially laid out on the disk, so the seeks in doing term-at-a-time query processing should be faster.
Since term-at-a-time accesses each posting list separately, it is typically only used for scoring functions that are of the form:
`sc\o\re(q,d) = quality(d) + sum_(t in q) sc\o\re(t,d)`.
Here `quality(d)` is an optional query-independent score component such as PageRank.

Merge Based Index Construction - Query Processing

Outline