Index Construction

Index construction can be classified into two categories: in-memory and disk-based.
As the former serves as a basis for the latter, we look at in-memory techniques first before talking about disk-based approaches.
We will focus our attention on schema independent indexes, but the techniques generalize to other index types.

In-memory Index Construction

We assume that our text collection now is small enough to fit in memory.
To index, we need:
- a dictionary that allows efficient single-term lookup and insertion operations
- an extensible list data structure that is used to store the postings for each term.

buildIndex

buildIndex (indexTokenizer)
{
   position := 0;
   while (inputTokenizer.hasNext()) {
      T := inputTokenizer.getNext();
      obtain dictionary entry for T; create new entry, if necessary;
      append new posting position to T's posting list;
      position ++;
   }
   sort all dictionary entries in lex order
   for each term T in the dictionary {
      write T's postings list to disk
   }
   write the dictionary to disk
}
return

Index-Time Dictionary

Recall we use the dictionary at two different times: At index time and at query time.
At index-time we need to be able to perform efficient term lookups (so can find posting list to append to) and we need to be able to do term insertions (for new entries).
The book gives an example where they use C++ STL data structures to do this.
The relevant structures are: map (which is tree-based) and hash_map which uses a hash table.
On a 10,000 document subset of GOV2, map took on average 620 ns/lookup; whereas, hash_map took 240ns/lookup on their hardware.
If this scaled to the complete GOV2 collection of 44,000,000 tokens, then the time spent to do just a single lookup of each token would total 3hrs.
Circa 2010, the fastest IR systems publicly available take less than 4hrs to completely index GOV2, so it must be doing these operations considerable faster.
So how can we customize our dictionary to speed things up?

Speeding up the Dictionary

Recall we said that term frequencies follow a Zipfian distribution.
So 90% of all term occurrences correspond to the 10,000 most frequent terms.
If we are using a hash table where collisions are resolved by chaining then it is important to keep the most frequent terms at the start of the given chain they belong to.
Two heuristics to do this are:
1. The insert-at-back heuristic -- when adding a new item to a chain add it to the end of the chain -- if it was discovered later chances are it is less frequent.
2. The move-to-front heuristic -- when a term lookup is done and the items is not found at the front of the chain move it to the front. This takes O(1) overhead and tends to keep frequent items at the front of the chain.
Using move-to-front, the dictionary implementation is largely insensitive to the size of the hash table. Even a small table (65,000ish) can deal with millions of terms. Further, on average using these heuristic the book found lookup time was around 90ns, so faster than the STL implementation.

In-Class Exercise

Suppose we have a hash-table of size 3, our terms are just integers, and we use the hash function n%3.
Suppose our corpus looks like: 1 2 3 4 5 6 6 6 9 5 2 3.
Assume we are using both the insert-at-back and move-to-front heuristics.
Show the sequence of dictionary operations (inserts and lookups) used indexing this corpus and what the final dictionary would look like.
Post your solutions to the Mar 17 In-Class Exercise Thread

Extensible in-memory postings lists

One quick way to implement postings lists would be to just use a linked-list.
The drawback to this is for each 32 or 64 bit posting, you need to maintain another 32 or 64 bit pointer.
This can almost double the space requirements of your postings lists.
If you used fixed-sized arrays to store the posting lists, you need at least two passes over the data: one to determine the size of each array, the second to store postings into them.
Another approach is somewhat like vectors in Java: on seeing a new term you initially allocate say space for 10 postings. When this is exhausted, you allocate a new array of some multiple of the old array size, copy the old elements to the new array, and then free the memory for the old array.
If the multiple is 2, then experimentally the book says the wasted memory runs around 25%.
A final approach which the book says performed the best in their experiments is to use a linked-list with grouping...
In this approach, the first node for a posting list is large enough to hold say 16 postings, when this is exhausted a new node is created of size some multiple of the current total space allocated for this list up to some limit, say 256 postings. The use of more than one posting per node is sometimes called using an unrolled linked-list.

Sort-based Index Construction

We now begin to look at disk-based approaches.
Suppose we are in the schema independent setting, and are just reading our corpus.
As we read each term, we could imagine emitting a pair (termID, position) and writing it to disk.
When we have finished scanning our corpus, our collection of pairs would be sorted by position.
To make lookup efficient in an inverted index we want to resort these by termID.
After resorting the ordered pairs would be in a format suitable for use by one of our in-memory dictionary structures.

Sort-based Index Construction Code

buildIndex_sortBased(inputTokenizer)
{
    position := 0;
    while (inputTokenizer.hasNext()) {
        T := inputTokenizer.getNext();
        termID := unique termID of T;
        write record R[position] := (termID, position) to disk;
        position++;
    }
    tokenCount := position;
    sort R[0], .., R[tokenCount-1] by first component; break ties with second component;
    perform a sequential scan of R[0], .., R[tokenCount-1] creating the final index;
    return;
}

Disk-Based Sorting

To get the above to work we need to be able to implement disk-based sorting.
This is often done by reading n items into memory at a time and sorting them, writing them back out.
This gives `|~ frac(mbox(tokenCount))(n) ~|` many blocks.
These blocks are then often merged using an n-way disk merge operation in logarithmically many passes.
Disk-based sorting can take a fair bit of space to store temporary files. For GOV2, the book says 492GB compared to the 426 GB size of the original collection.
Keeping track of globally unique termIDs (so can decide if need to create new entry) can be RAM memory intensize, for GOV2 it takes more than 1GB of RAM.

Merge-based Index Construction

This approach is direct extension of our in-memory hash-based index approach.
If the index is small enough to fit in memory the two approaches will be the same.
If the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition.
The in-memory index is wiped, and we continue indexing as if from scratch.
After going through the collection one has a sequence of partitions on disk.
Terms in this set-up are their own ID, posting lists in each partition are sorted in lexicographical order of their terms.
The final stage of the algorithm is to then merge each of the partitions into the final index.

Merge-Based Index Pseudocode

buildIndex_mergeBase(inputTokenizer, memoryLimit) 
{
    n := 0;
    position := 0;
    memoryConsumption := 0;
    while (inputTokenizer.hasNext()) {
        T := inputTokenizer.getNext();
        obtain dictionary entry for T;
        create new entry if necessary;
        append new position to T's posting list
        position++;
        memoryConsumption++;
        if (memoryConsumption > memoryLimit) {
            createIndexPartition();
        }
    }
    if (memoryConsumption > 0) {
        createIndexPartition();
    }
    merge index partitions I[0],...,I[n-1]
        to make final index I_final;
}

createIndexPartition()
{
    create empty on disk inverted file I[n];
    sort in-memory dictionary entries in lex order;
    for each term T in dictionary {
        add T's posting list to I[n];
    }
    delete all in memory posting lists;
    write the dictionary to disk
    reset the in-memory dictionary;
    memoryConsumption := 0;
    n++;
}

mergeIndexPartitions([I[0], ..., I[n-1]])
{
    create empty Inverted File I_final;
    for (k = 0; k < n; k++) {
        open partition I[k] for sequential processing;
    }
    currentIndex := 0;
    while (currentIndex != nil) {
        currentIndex = nil;
        for (k = 0; k < n; k++) {
            if (I[k] still has terms left) {
                if (currentIndex == nil || 
                    I[k].currentTerm < currentTerm) {
                    currentIndex := I[k];
                    currentTerm := I[k].currentTerm;
                }     
            }
        }
        if (currentIndex != nil) {
            I_final.addPostings(currentTerm,
                currentIndex.getPostings(currentTerm));
            currentIndex.advanceToNExtTerm();
        }
    }
    delete I[0], ..., I[n-1];
}

Remarks on Merge Algorithm

The algorithm takes time which grows only slightly more than linearly in the size of the collection
You need to be able to keep at least a few pages from each partition in memory at a time, so your RAM limits the total size of the collection you can index.
Even if you can keep one page of each partition in memory, being able to keep more will often subtantially improve your performance.

Index Construction

Outline

Index Construction

In-memory Index Construction

buildIndex

Index-Time Dictionary

Speeding up the Dictionary

In-Class Exercise

Extensible in-memory postings lists

Sort-based Index Construction

Sort-based Index Construction Code

Disk-Based Sorting

Merge-based Index Construction

Merge-Based Index Pseudocode

Remarks on Merge Algorithm