Gap Compression, Dynamic Inverted Indexes




CS267

Chris Pollett

Oct 28, 2019

Outline

Finishing Up General Text Compression

Compressing Posting Lists: `Delta`-values

Nonparametric Gap Compression

Quiz

Which of the following is true?

  1. The trec_eval program makes use of a sub-module to automatically judge relevance of documents without human involvement.
  2. A static, symbol-wise compression method compresses symbols in the same way independent of message being compressed.
  3. To code a string using arithmetic compression we first build a Huffman tree.

Parametric Gap Compression

Geometric Distributions and Posting Lists

Golomb/Rice Codes

Finding the Modulus

Byte-Aligned Codes

Dynamic Inverted Indexes

Batch Updates

REBUILD versus REMERGE

Incremental Index Updates

In-memory Hash Index

NO MERGE Index Updates

  • Suppose that, while the search engine is building an index, say after creating `n` on-disk index partitions, we want it to process a keyword query composed of `m` query terms.
  • We could repeat the following procedure for each of the query terms:
    1. Fetch the terms postings list fragment from each of the `n` posting lists on disk index partitions
    2. Use the in-memory hash table to fetch the term's in-memory list fragment.
    3. Concatenate all n+1 fragments to form the terms postings list.
  • This strategy is called the NO MERGE index update strategy.
  • It tends not to be a very attractive strategy, due to the large number of disk seeks required to process a search query. (one for every query term and index partition).
  • It is often used as a baseline to which other strategies are compared.
  • Contiguous Inverted Lists

    REMERGE UPDATE

    In-place Index Updates