Dynamic Inverted Indexes




CS267

Chris Pollett

Nov. 14, 2012

Outline

Dynamic Inverted Indexes

Batch Updates

REBUILD versus REMERGE

HW Problem

Exercise 6.6. What is the expected number of bits per codeword when using a Rice code with parameter `M = 2^7` to compress a geometrically distributed postings list for a term `T` with `N_t/N approx 0.01`?

Answer. `E[mbox(bits/code word)] = sum_(k=1)^(infty) P(mbox(gap is k))cdot (mbox(length of codeword when the gap is k)).`
Since the postings are distributed geometrically with `N_t/N approx 0.01` we have:
`P(mbox(gap is k)) = (1 - 0.01)^(k-1)(0.01) = 0.01(0.99)^(k-1)`.
The length of a codeword when the gap is `k` using a Rice code is `q(k) +1 + lfloor log_2(M) rfloor` where `q(k) = lfloor (k-1)/M rfloor`. Since we are given `M = 2^7`, the length is
`1 + 7 + lfloor (k-1)/(2^7) rfloor = 8 + lfloor (k-1)/(2^7) rfloor`.
So `E[mbox(bits/code word)]` is
`sum_(k=1)^(infty) 0.01(0.99)^(k-1) cdot (8 + lfloor (k-1)/(2^7) rfloor)`
`approx 0.08 sum_(k=1)^(infty) (0.99)^(k-1) + 7.81 times 10^(-5) sum _(k=1)^(infty) (k-1)(0.99)^(k-1)`
` approx 0.0799 sum_(k=1)^(infty) (0.99)^(k-1) + 7.81 times 10^(-5) sum _(k=1)^(infty) k(0.99)^(k-1)`
Temporarily, set `r= 0.99`, so we have
` = 7.81 times 10^(-5) d/(dr)(sum _(k= 0)^(infty) r^k) + 0.0799 sum_(k=0)^(infty) r^(k)`
Using that the sum of a geometric series `sum_(k=0)^(infty)r^k` is `1/(1-r)`. ` = 7.81 times 10^(-5) d/(dr)(1/(1 - r)) + 0.0799 1/(1-r)`
Taking the derivative
` = 7.81 times 10^(-5) /(1-r)^2 + 0.0799 1/(1-r)`
Finally, substituting for `r`
`= 0.781 + 7.99 = 8.77` bits.

Incremental Index Updates

In-memory Hash Index

NO MERGE Index Updates

  • Suppose that, while the search engine is building an index, say after creating `n` on-disk index partitions, we want it to process a keyword query composed of `m` query terms.
  • We could repeat the following procedure for each of the query terms:
    1. Fetch the terms postings list fragment from each of the `n` on disk index partitions
    2. Use the in-memory hash table to fetch the term's in-memory list fragment.
    3. Concatenate all n+1 fragments to form the terms postings list.
  • This strategy is called the NO MERGE index update strategy.
  • It tends not to be a very attractive strategy, due to the large number of disk seeks required to process a search query. (one for every query term and index partition).
  • It is often used as a baseline to which other strategies are compared.
  • Contiguous Inverted Lists

    REMERGE UPDATE

    In-place Index Updates