Index-Time Dictionary

Last Monday, we had begun talking about in-memory inverted index construction algorithms as a prelude to disk-based algorithms.
Recall we use the dictionary at two different times: At index time and at query time.
At index-time we need to be able to perform efficient term lookups (so can find posting list to append to) and we need to be able to do term insertions (for new entries).
The book gives an example where they use C++ STL data structures to do this.
The relevant structures are: map (which is tree-based) and hash_map which uses a hash table.
On a 10,000 document subset of GOV2, map took on average 620 ns/lookup; whereas, hash_map took 240ns/lookup on their hardware.
If this scaled to the complete GOV2 collection of 44,000,000 tokens, then the time spent do just a single lookup of each token would total 3hrs.
Currently, the fastest IR systems publicly available take less than 4hrs to completely index GOV2, so must be doing these operations considerable faster.
So how can we customize our dictionary to speed things up?

Speeding up the Dictionary

Recall we said that term frequencies follow a Zipfian distribution.
So 90% of all term occurrences corrsepond to the 10,000 most frequent terms.
If we are using a hash table where collisions are resolved by chaining then it is important to keep the most frequent terms at the start of the given chain they belong to.
Two heuristics to do this are:
1. The insert-at-back heuristic -- when adding a new item to a chain add it to the end of the chain -- if it was discovered later chances are it is less frequent.
2. The move-to-front heuristic -- when a term lookup is done and the items is not found at the front of the chain move it to the front. This takes O(1) overhead and tends to keep frequent items at the front of the chain.
Using move-to-front, the dictionary implementation is largely insensitive to the size of the hash table. Even a small table (65,000ish) can deal with millions of terms. Further, on average using these heuristic the book found lookup time was around 90ns, so faster than the STL implementation.

Extensible in-memory postings lists

One quick way to implement postings lists would be to just use a linked-list.
The drawback to this is for each 32 or 64 bit posting, you need to maintain another 32 or 64 bit pointer.
This can almost double the space requirements of your postings lists.
If you used fixed-sized arrays to store the posting lists, you need at least two passes over the data: one to determine the size of each array, the second to store postings into them.
Another approach is somewhat like vectors in Java: on seeing a new term you initially allocate say space for 10 postings. When this is exhausted, you allocate a new array of some multiple of the old array size, copy the old elements to the new array, and then free the memory for the old array.
If the multiple is 2, then experimentally the book says the wasted memory runs around 25%.
A final approach which the book says performed the best in their experiments is to use a linked-list with grouping...
In this approach, the first node for a posting list is large enough to hold say 16 postings, when this is exhausted a new node is created of size some multiple of the current total space allocated for this list up to some limit, say 256 postings. The use of more than one posting per node is sometimes called using an unrolled linked-list.

HW Problem

Exercise 4.1. Suppose the posting list for some term consists of 64million 4byte postings. To carry out a random access of this posting list, the search engine performs two disk read operations: (1) Loads the per-term index into RAM; (2) Loads a block B of postings into RAM which was identified by binary search of synchronization points. Let granularity (Gran) be the number of postings/sync point. What is the optimal granularity for this situation? What is the total number of bytes read from disk?

Let `NumSync =` be the total number of sync points `= (64 times 10^6)/(Gran)`.

Let `BIO = ` bytes per I/O ; let `TIO = ` total number of I/Os.

So `TIO = (4 times NumSync)/(BIO) + (4 times Gran) /(BIO) = (256 times 10^6)/(Gran times BIO) + (4 times Gran) /(BIO)`.

To minimize, we differentiate with respect to `Gran`, giving:
`frac(d TIO)(d Gran) = frac(d)(d Gran)((256 times 10^6)/(Gran times BIO)) + frac(d)(d Gran)((4 times Gran) /(BIO))`
`= -((256 times 10^6)/(Gran^2 times BIO)) + 4/(BIO) = 0`

Solving this gives `Gran = 8000` postings. So `NumSync = 8000` synchronization points. Assuming `4`KB blocks sizes, reading `B` would take `8` I/Os, and reading all the synchronization records would also take `8` I/Os for a total of `16` I/Os.

Sort-based Index Construction

We now begin to look at disk-based approaches.
Suppose we are in the schema independent setting, and are just reading our corpus.
As we read each term, we could imagine emitting a pair (termID, position) and writing it to disk.
When we have finished scanning our corpus, our collection of pairs would be sorted by position.
To make lookup efficient in an inverted index we want to resort these by termID.
After resorting the ordered pairs would be in a format suitable for used by one of our in-memory dictionary structures.

Sort-based Index Construction Code

buildIndex_sortBased(inputTokenizer)
{
    position := 0;
    while (inputTokenizer.hasNext()) {
        T := inputTokenizer.getNext();
        obtain dictionary entry for T, create new entry if necessary;
        termID := unique termID of T;
        write record R[position] := (termID, position) to disk;
        position++;
    }
    tokenCount := position;
    sort R[0], .., R[tokenCount-1] by first component; break ties with second component;
    perform a sequential scan of R[0], .., R[tokenCount-1] creating the final index;
    return;
}

Disk-Based Sorting

To get the above to work we need to be able to implement disk-based sorting.
This is often done by reading n items into memory at a time and sorting them, writing them back out.
This gives `|~ frac(mbox(tokenCount))(n) ~|` many blocks.
These blocks are then often merged using an n-way disk merge operation in logarithmically many passes.
Disk-based sorting can take a fair bit of space to store temporary files. For GOV2, the book says 492GB compared to the 426 GB size of the original collection.
Keeping track of globally unique termIDs can be RAM memory intensize, for GOV2 it takes move than 1GB of RAM.

Merge-based Index Construction

This approach is direct extension of our in-memory hash-based index approach.
If the index is small enough to fit in memory the two approaches will be the same.
If the index is too big to fit in memory, the in-memory index is written to the disk into a file called a partition.
The in-memory index is wiped, and we continue indexing as if from scratch.
After going through the collection one has a sequence of partitions on disk.
Terms in this set-up are their own ID, posting lists in each partition are sorted in lexicographical order of their terms.
The final stage of the algorithm is to then merge each of the partitions into the final index.

Merge-Based Index Pseudocode

Merge-based index construction pseudo-code

Remarks on Merge Algorithm

The algorithm takes time which grows only slightly more than linearly in the size of the collection
You need to be able to keeps at least a few pages from each partition in memory at a time, so your RAM limits the total size of the collection you can index.
Even if you can keep one page of each partition in memory, being able to keep more will often subtantially improve your performance.

More Index Construction

Outline