Introduction

Last week, we were looking at techniques to build inverted indexes.
We first focused on sort-based and hash-based dictionary construction techniques.
We looked at the dictionary-as-a-string approach to sort-based dictionaries, and various heuristics for collision resolution hash-based index construction such as move-to-front and insert-at-back.
We talked about ways to lay out posting lists to make it easier to scan for a particular offset such as the use of per-term indexes.
We then went over a sort-based, static, inverted index construction algorithm, where as we process documents we emit pairs (termID, offset), once we've processed all the documents, we sort the pairs by termID breaking ties by the offset, compressing offset to posting lists as we go.
If the pairs are stored on disk, then to do the sorting we use an n-way merge sort, where n is the number of RAM blocks we have for sorting.
I.e., we read first read in n-blocks at a time, sort, and write-back to disk, to create sorted runs of length n. We then merge n-1 such runs at a time (have a block in memory from each run + an output block) to make runs of length `n\cdot(n-1)` and keep going till sorted.
We begin today by describing a second approach to disk-based index construction directly using merging.

Merge-based Index Construction

This approach is a direct extension of our in-memory hash-based index approach.
If the index is small enough to fit in memory the two approaches will be the same.
If the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition.
The in-memory index is wiped, and we continue indexing as if from scratch.
After going through the collection one has a sequence of partitions on disk.
Terms in this set-up are their own ID, posting lists in each partition are sorted in lexicographical order of their terms.
The final stage of the algorithm is to then merge each of the partitions into the final index.

Merge-Based Index Pseudocode

buildIndex_mergeBase(inputTokenizer, memoryLimit) 
{
    n := 0;
    position := 0;
    memoryConsumption := 0;
    while (inputTokenizer.hasNext()) {
        T := inputTokenizer.getNext();
        obtain dictionary entry for T;
        create new entry if necessary;
        append new position to T's posting list
        position++;
        memoryConsumption++;
        if (memoryConsumption > memoryLimit) {
            createIndexPartition();
        }
    }
    if (memoryConsumption > 0) {
        createIndexPartition();
    }
    mergeIndexPartitions(I[0],...,I[n-1])
       // to make final index I_final;
}

createIndexPartition()
{
    create empty on disk inverted file I[n];
    sort in-memory dictionary entries in lex order;
    for each term T in dictionary {
        add T's posting list to I[n];
    }
    delete all in memory posting lists;
    write the dictionary to disk
    reset the in-memory dictionary;
    memoryConsumption := 0;
    n++;
}

mergeIndexPartitions([I[0], ..., I[n-1]])
{
    create empty Inverted File I_final;
    for (k = 0; k < n; k++) {
        open partition I[k] for sequential processing;
    }
    currentIndex := I[0];// anything other than nil so go through loop once
    while (currentIndex != nil) {
        currentIndex = nil;
        for (k = 0; k < n; k++) {
            if (I[k] still has terms left) {
                if (currentIndex == nil || 
                    I[k].currentTerm < currentTerm) {
                    currentIndex := I[k];
                    currentTerm := I[k].currentTerm;
                }     
            }
        }
        if (currentIndex != nil) {
            I_final.addPostings(currentTerm,
                currentIndex.getPostings(currentTerm));
            currentIndex.advanceToNextTerm();
        }
    }
    delete I[0], ..., I[n-1];
}

Remarks on Merge Algorithm

The algorithm takes time which grows only slightly more than linearly in the size of the collection
You need to be able to keep at least a few pages from each partition in memory at a time, so your RAM limits the total size of the collection you can index.
Even if you can keep one page of each partition in memory, being able to keep more will often subtantially improve your performance.

Query Processing

We now look at methods to implement query processing using these data structures.
Many IR systems implement some kind of Boolean Model for retrieval.
We will look at ranked retrieval and lightweight structure (ex: all documents which have "drug" and "aspirin" within 10 words) as retrieval methods.
For now, we will be more interested in the efficiency of retrieval over effectiveness measures of results.

Query Processing for Ranked Retrieval

Boolean and Ranked retrieval are often complementary: One typically, determines a set of documents using a Boolean retrieval method and then rank those documents according to some ranking method.
Given a query vector ("greek", "philosophy", "stoicism"), there are two natural ways one could interpret the query: We are looking for the documents that have all of these terms; or we are looking for the documents that have any of these terms, but more terms are better than less.
The former case, called conjunctive query processing, is favored by search engines; the latter, called disjunctive query processing, by more traditional IR systems.
In TREC 45, according to the book, 7834 documents match the disjunctive interpretation, only a single document matches the conjunctive version.
The one document match was about a Mexican actor's film projects so not relevant.
Many of the disjunctive results were relevant. One reason for this is that if you are talking about stoicism, a reader can already be presumed to know that it is both a philosophy and it originated in Greece.
The book prefers disjunctive query processing, for the above, and because the authors feel that adding terms to the query should not hurt search results.

Okapi BM25

Once we have retrieved our documents, we need to score and sort them.
For now, we will assume we are doing this using Okapi BM25 (BM25 for short).
Later in the semester we show how sophisticated scoring functions can be derived. For BM25, we are just going to define it and give some intuition. First, the definition:
`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
The two free parameters in the above `k_1` and `b` are usually set of `1.2` and `0.75` respectively.
`b` controls the degree of document length normalization.
`k_1` caps the score contribution of an individual query term as:
`lim_(f_(t,d) -> infty) TF_(BM25)(t,d) = k_1 +1`.

BM25 Example

Suppose we have a two term query `q = (t_1, t_2)` where `N_(t_1) approx N_(t_2)`, and we are scoring two documents `d_1` and `d_2` of average length `l_(d_1) = l_(d_2) =l_(avg)`. Suppose `d_1` contains one occurrence of each of `t_1` and `t_2`, while `d_2` contains `10` occurrences of `t_1` and no occurrences of `t_2`.
Then we have: `S\c\o\r\e_(BM25)(q, d_1) = log(frac(N)(N_(t_1))) cdot (2 cdot frac(1 cdot (k_1 + 1))(1 + k_1)) approx 2 cdot log(frac(N)(N_(t_1)))`
`S\c\o\r\e_(BM25)(q, d_2) = log(frac(N)(N_(t_1))) cdot (1 cdot frac(10 cdot (k_1 + 1))(10 + k_1)) approx 1.95 cdot log(frac(N)(N_(t_1)))`
assuming `k_1 = 1.2`.
Later, when doing disjunctive query processing, we will use the upper bounds provided by `k_1` to ignore postings for which we know a priori that they cannot push the corresponding document into the top of search results.

Quiz

Which of the following is true?

Hash-based dictionaries are better if we want to support prefix queries.
The move-to-front heuristic is used as part of hash-based dictionary construction.
Self-indexing is synonym for using a B-tree for posting lists.

Document-at-a-Time Query Processing

Document-at-a-Time Query Processing is the most popular form of query processing for ranked retrieval.
In this method all matching documents are enumerated, one after the other, and a score is computed for each.
Once all documents have been processed, the documents are sorted according to their score and the top `k` results are returned.
To use BM25 in such an algorithm, we can take our code for rankCosine and simply replace the line where the score is calculated with the BM25 score calculation.
The overall time complexity of this algorithm is `Theta(m cdot n + m cdot log m)` where `n` is the number of query terms; `m` is the number of documents returned.
The `m cdot n` comes from the while loop, the `m cdot log m` from doing a sort of the result.
Notice sorting all `m` results is a waste if we are only interested in returning the top `k`.
Also in computing the score as well as the min over nextDoc, we need to iterate over all terms in the query regardless of whether the document has them or not.
We can use heaps to address these two problems.
In order, to get a bound on the resulting algorithm we introduce a little notation.
Let `N_q = N_{t_1} + N_{t_2} cdots + N_{t_n}`. Then `m` is somewhere between `N_q/n` and `N_q` and we also get the run time bound `Theta(N_q cdot n + N_q cdot log N_q)`. In practice, it will tend to be the case that a document will contain only one of the query terms and so `m` will tend to be closer to `N_q` than `N_q/n`.

Binary Heaps

Recall a (binary) heap is a binary tree that satisfies: (1) the empty binary tree is a heap, (2) a non-empty binary tree `T` is a heap, if (a) `T` is completely filled on all levels, except the deepest one, (b) `T`'s deepest level is filled from left to right, (c) For each node `v` in `T`, the value stored in `v` is smaller than the value stored in any of its children.
Heaps can be represented either using trees or using arrays, the latter almost always being used as it is faster.
Heaps support an operation called REHEAP where we take the root value of the tree and replace it with a new value and push the new value down the heap and until the heap property is restored.
REHEAP can be done in `O(log n)` steps.

Merge Based Index Construction - Query Processing

Outline

Introduction

Merge-based Index Construction

Merge-Based Index Pseudocode

Remarks on Merge Algorithm

Query Processing

Query Processing for Ranked Retrieval

Okapi BM25

BM25 Example

Quiz

Document-at-a-Time Query Processing

Binary Heaps