Divergence-from-randomness, Parallel Information Retrieval




CS267

Chris Pollett

Nov. 30, 2011

Outline

Introduction

The Basic DFR Formula

More on the Basic DFR Formula

The Binomial Coefficient

Estimating `P_1`

  • To compute `P_1` suppose `d` is found to contain `f_(t,d)` occurrences of `t`. How many ways can we distribute of the remaining `l_t - f_(t,d)` occurrences? Using the binomial coefficient, we get:
    `(((N -1) + (l_t - f_(t,d)) - 1),(l_t -f_(t,d))) = frac(((N - 1)+ (l_t - f_(t,d)) - 1)!)((N-2)!(l_t - f_(t,d))!)` (**)
  • We can estimate `P_1` as the ratio (**)/(*). Unfortunately, this has lots of nasty factorials in it, which make it difficult to compute. There are different ways to estimate factorials such as Stirling's approximation. Using one of these, Amati and van Rijsbergen get the following estimates for `P_1` and `-log P_1`:
    `P_1 = (frac(1)(1+l_t/N))(frac(l_t/N)(1 + l_t/N))^(f_(t,d))`
    and
    `-log P_1 = log(1 + l_t/N) + f_(t,d) cdot log(1 + N/l_t)`.
  • Estimating `P_2` and Computing DFR

    Document Length Normalization

    Parallel Information Retrieval

    Parallel Query Processing

    Index Partitioning Schemes