Divergence-from-randomness, Parallel Information Retrieval

CS267

Chris Pollett

Apr 28, 2021

Outline

Divergence-from Randomness
In-Class Exercise
Parallel Information Retrieval

Introduction

On Monday, we were considering language modeling approaches to coming up with relevance measures.
The idea was given a document `d` try to calculate the probability that a given query `q` might be used to retrieve that document. Then use some variant of this probability as a relevance measure.
The probabilities that we calculated were based on a language model for a document which took the basic maximum likelihood model, then did smoothing using a linear, Dirichlet, or random walk approach.
When such smoothing is done, the model one comes up with consists of a background corpus model combined with those features of `d` on which it differs from the background model.
The approach we begin with today, divergerce-from-randomness (DFR) explicitly assumes a random process for the distribution of terms in documents, and then ranks documents by considering the probability that the actual term distribution found in a document would occur by chance.
Unlike BM25 with `k_1` and `b`, LMJM with `lambda`, and LMD with `mu`, DFR does not have arbitrary parameter values that need to be set.

The Basic DFR Formula

The basic starting formula for DFR is the following:
`(1 - P_2)cdot(-log P_1)`.
`P_1` represents the probability that a random document `d` contains exactly `f_(t,d)` occurrences of `t`.
The value `- log P_1` may be viewed as the number of bits of information (self information) associated with the knowledge that `d` has exactly `f_(t,d)` occurrences of `t`.
`P_1` decreases rapidly as `f_(t,d)` increases and so `-log P_1` increases rapidly in `f_(t,d)`.
The second factor, `P_2` is related to eliteness and compensates for this rapid change. A document is said to be elite for the term `t` when it "about" the topic associated with the term.
If `d` is elite in `t`, we might assume that the occurrences of `t` that appear within it are not accidental. If we start reading `d` and `d` is elite in `t`, we might well before the end discover `f_(t,d) - 1` occurrences of `t`. We might given this expect to find more occurrences of `t` in `d`. `P_2` represents the probability of finding at least one.
`P_2` increases as `f_(t,d)` increases. Thus `(1-P_2)` decreases as `f_(t,d)` increases.

More on the Basic DFR Formula

DFR was first developed by Amati and Rijsbergen (2002). It has performed well in various TREC conferences since.
To determine the relevance of a document to a query with this model we calculate: `sum_(t in q) q_t(1 - P_(2,t)) cdot(-log P_(1,t))` where `P_(1,t)` and `P_(2,t)` are the `P_1` and `P_2` associated with the particular term `t`. Here `q_t` is the number of occurrences of the term `t` in the query.
This of course leaves the problem of how to estimate `P_1` and `P_2`.
Amati and Rijsbergen in their paper presented seven different methods for doing this. They also presented different methods of incorporating document length normalization into DFR.
We will now look at one way of estimating each.

The Binomial Coefficient

Suppose we randomly distribute terms into documents. If we have `l_t` occurrences of term `t` distributed across `N` documents, then
`f_(t,1) + f_(t,2) + cdots + f_(t,N) = l_t`.
How many different ways can `l_t` occurrences be distributed across `N` documents?
The answer is given by the binomial coefficient:
`((N + l_t - 1),(l_t)) = frac((N+l_t - 1)!)((N-1)!l_t!)` (*)
To see this we represent one way to partition occurrences in the format ++|+|+ ... where '+' represents an occurrence and '|' represents a document boundary.
To completely represent one such partition we need to choose from `N + l_t - 1` slots to write down, `l_t` of them which are to be '+'s. Order doesn't matter.

Estimating `P_1`

To compute `P_1` suppose `d` is found to contain `f_(t,d)` occurrences of `t`. How many ways can we distribute of the remaining `l_t - f_(t,d)` occurrences? Using the binomial coefficient, we get:
`(((N -1) + (l_t - f_(t,d)) - 1),(l_t -f_(t,d))) = frac(((N - 1)+ (l_t - f_(t,d)) - 1)!)((N-2)!(l_t - f_(t,d))!)` (**)
We can estimate `P_1` as the ratio (**)/(*). Unfortunately, this has lots of nasty factorials in it, which make it difficult to compute. There are different ways to estimate factorials such as Stirling's approximation. Using one of these, Amati and van Rijsbergen get the following estimates for `P_1` and `-log P_1`:
`P_1 = (frac(1)(1+l_t/N))(frac(l_t/N)(1 + l_t/N))^(f_(t,d))`
and
`-log P_1 = log(1 + l_t/N) + f_(t,d) cdot log(1 + N/l_t)`.

Estimating `P_2` and Computing DFR

To estimate `P_2`, Amati and van Rijsbergen use Laplace's law of succession (rule of succession).
Suppose we know an event can take one of two values (say the sun rises in the morning -- it might happen, it might not).
We have observed that the event has occurred in the last `m-1` trials. What odds should we assign to it occurring in `m`th trial in the absence of any additional information.
The rule of succession says since we know both outcomes are possible, it is as if we had an event in addition to our `m-1` trials in which the outcome failed, and so we should estimate the probability as: `m/(m+1)`.
Using this we can estimate `P_2` as:
`frac(f_(t,d))(f_(t,d) + 1)`
Using this we can estimate `(1 - P_2)(-log P_1)` as:
`frac(log(1 + l_t/N) + f_(t,d) log(1 + N/l_t))(f_(t,d) + 1)`

Document Length Normalization

Our model so far assumes that all documents have the same length.
Amati and van Rijsbergen suggest a normalization in which an adjusted `f'_(t,d)` replaces `f_(t,d)` in the last equation.
They derive the following adjustment:
`f'_(t,d) = f_(t,d) cdot log(1 + l_(avg)/l_d)`
which boosts the frequency over the original value for shorter documents.
With this correction, the book gives an experimental comparison of LMJM, LMD, and DFR on TREC45 and GOV2.
LMD and DFR perform comparably, both performing better than LMJM. These both also perform at the same level as BM25.

In-Class Exercise

Suppose we had a corpus of 500,000 previous queries and we are trying to rank them according to DFR with a query that someone is currently typing.
So far they've typed: California Business.
The average length of a query in our corpus of queries is 3.
The number of occurrences of the terms "California", "Business", "Tax", and "Return" in the whole corpus are respectively 104, 501, 254, 607.
Assume we are using the document length normalization of the previous slide, how would the following queries score using DFR:
1. California Business Tax
2. California Business Tax Return
Post your solution to the Apr 28 In-Class Exercise Thread.

Parallel Information Retrieval

The next topic we are considering this semester are different techniques for parallelizing Information Retrieval.
The main reason why we want to do this is that a single computer just doesn't have the computational power to index and store the world-wide web.
There are two aspects of this problem we will consider: How to handle queries in parallel? and How to do indexing and other related tasks in parallel?

Parallel Query Processing

Two popular approaches to making search engines process queries faster are index partitioning and replication.
We refer to servers in our discussion as nodes.
By making `n` replicas of the index and assigning each replica to a separate node, we can realize an `n`-fold increase in the search engines service rate without affecting the time process a single query.
This is called inter-query parallelism -- multiple queries can be processed in parallel but each individual query is processed sequentially.
Alternatively, we can split the index into `n` parts and have each node work only on its small part of the index. This is called intra-query parallelism because each query is processed by multiple servers in parallel.
This approach can improve both the engines service rate as well as the average time per query.
We now look at the latter approach in more detail.