Divergence-from-randomness, Parallel Information Retrieval




CS267

Chris Pollett

Nov. 26, 2012

Outline

Introduction

The Basic DFR Formula

More on the Basic DFR Formula

Quiz

Which of the following is true?

  1. In our language modeling approach to relevance we first reduced the probability ranking principle to the question of calculating `p(Q|D,r)`.
  2. Pseudo-relevance Feedback was a modification to our logarithmic merging algorithm that alters when merging is done.
  3. The Jelinek Mercer Divergence approach to language modeling is based on the relative entropy of the language model determined by the query and that determined by the document

The Binomial Coefficient

Estimating `P_1`

  • To compute `P_1` suppose `d` is found to contain `f_(t,d)` occurrences of `t`. How many ways can we distribute of the remaining `l_t - f_(t,d)` occurrences? Using the binomial coefficient, we get:
    `(((N -1) + (l_t - f_(t,d)) - 1),(l_t -f_(t,d))) = frac(((N - 1)+ (l_t - f_(t,d)) - 1)!)((N-2)!(l_t - f_(t,d))!)` (**)
  • We can estimate `P_1` as the ratio (**)/(*). Unfortunately, this has lots of nasty factorials in it, which make it difficult to compute. There are different ways to estimate factorials such as Stirling's approximation. Using one of these, Amati and van Rijsbergen get the following estimates for `P_1` and `-log P_1`:
    `P_1 = (frac(1)(1+l_t/N))(frac(l_t/N)(1 + l_t/N))^(f_(t,d))`
    and
    `-log P_1 = log(1 + l_t/N) + f_(t,d) cdot log(1 + N/l_t)`.
  • Estimating `P_2` and Computing DFR

    Document Length Normalization

    Parallel Information Retrieval

    Parallel Query Processing

    Index Partitioning Schemes