Merge Based Index Construction - Query Processing




CS267

Chris Pollett

Oct 23, 2023

Outline

Introduction

Merge-based Index Construction

Merge-Based Index Pseudocode

buildIndex_mergeBase(inputTokenizer, memoryLimit) 
{
    n := 0;
    position := 0;
    memoryConsumption := 0;
    while (inputTokenizer.hasNext()) {
        T := inputTokenizer.getNext();
        obtain dictionary entry for T;
        create new entry if necessary;
        append new position to T's posting list
        position++;
        memoryConsumption++;
        if (memoryConsumption > memoryLimit) {
            createIndexPartition();
        }
    }
    if (memoryConsumption > 0) {
        createIndexPartition();
    }
    mergeIndexPartitions(I[0],...,I[n-1])
       // to make final index I_final;
}

createIndexPartition()
{
    create empty on disk inverted file I[n];
    sort in-memory dictionary entries in lex order;
    for each term T in dictionary {
        add T's posting list to I[n];
    }
    delete all in memory posting lists;
    write the dictionary to disk
    reset the in-memory dictionary;
    memoryConsumption := 0;
    n++;
}

mergeIndexPartitions([I[0], ..., I[n-1]])
{
    create empty Inverted File I_final;
    for (k = 0; k < n; k++) {
        open partition I[k] for sequential processing;
    }
    currentIndex := I[0];// anything other than nil so go through loop once
    while (currentIndex != nil) {
        currentIndex = nil;
        for (k = 0; k < n; k++) {
            if (I[k] still has terms left) {
                if (currentIndex == nil || 
                    I[k].currentTerm < currentTerm) {
                    currentIndex := I[k];
                    currentTerm := I[k].currentTerm;
                }     
            }
        }
        if (currentIndex != nil) {
            I_final.addPostings(currentTerm,
                currentIndex.getPostings(currentTerm));
            currentIndex.advanceToNextTerm();
        }
    }
    delete I[0], ..., I[n-1];
}

Remarks on Merge Algorithm

Query Processing

Query Processing for Ranked Retrieval

Okapi BM25

BM25 Example

Quiz

Which of the following is true?

    1. Hash-based dictionaries are better if we want to support prefix queries.
    2. The move-to-front heuristic is used as part of hash-based dictionary construction.
    3. Self-indexing is synonym for using a B-tree for posting lists.

Document-at-a-Time Query Processing

Binary Heaps