More Index Construction




CS267

Chris Pollett

Oct. 17, 2011

Outline

Index-Time Dictionary

Speeding up the Dictionary

Extensible in-memory postings lists

Quiz

Which of the following is true?

  1. A per-term index means that a separate dictionary file is used for every term in the inverted index.
  2. Dictionary interleaving connects the dictionary directly to a document map without the use of posting lists.
  3. A dictionary-as-string approach is often more space efficient than using a fixed number of characters for terms in the dictionary.

Sort-based Index Construction

Sort-based Index Construction Code

buildIndex_sortBased(inputTokenizer)
{
    position := 0;
    while (inputTokenizer.hasNext()) {
        T := inputTokenizer.getNext();
        obtain dictionary entry for T, create new entry if necessary;
        termID := unique termID of T;
        write record R[position] := (termID, position) to disk;
        position++;
    }
    tokenCount := position;
    sort R[0], .., R[tokenCount-1] by first component; break ties with second component;
    perform a sequential scan of R[0], .., R[tokenCount-1] creating the final index;
    return;
}

Disk-Based Sorting

Merge-based Index Construction

Merge-Based Index Pseudocode

Merge-based index construction pseudo-code
Merge Partition pseudo-code

Remarks on Merge Algorithm