BM25F and Pseudo-Relevance Feedback

We are now going to return to the topic of relevance measures for documents.
For lack of time we are largely going to skip over the material in Chapter 8.
Chapter 8 gives the derivation of BM25, which you recall was the following equations for relevance: `S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
A couple of notions from Chapter 8, I'd still like to briefly mention are BM25F and Pseudo Relevance feedback.
The idea behind BM25F (F is for field weights) is to split documents in our collection into different components. For example, title and body
We then calculate the BM25 score for just the title words; similarly, the BM25 score for just the body texts.
We then take a weighted sum of these two scores. The book gives an example where titles are weight 10, and body's weighted 1. For GOV2, BM25F doesn't perform much better but for the TREC 2006 web tasks, BM25F yields about a 15-20% improvement in the top 10 results.
Pseudo-Relevance feedback is where we perform an initial query and compute the top `m` results. We then use our scoring system to find the terms for which these results are the most relevant up to a term selection value.
We then add these terms to our query and compute our query results based on this expanded query.
The book shows this yields an improvement of about 20% over straight BM25 for P@10 and MAP scores on GOV2.

Modeling Relevance

We now begin our discussion of how to formally model relevance.
The BM25 approach to this is a so-called probabilistic retrieval model; the approach we are going to begin looking at today (divergence-from-randomness) is called a language modeling approach.
The starting point for our derivations is the so-called Probability Ranking Principle:
If an IR system's response to each query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its user will be maximized.
To make this into a starting equation for a relevance measure, we ask instead the "Basic Question":
What is the probability that a user will judge this document relevant to this query?
This can be represented as the equations:
`p(R=1 | D=d, Q=q)`
We will often write `r` to mean `R=1` and `bar(r)` to mean `R=0`, just `D` and `Q` for `D=d` and `Q=q`. So we get the shorter to write version of the above:
`p(r | D, Q)` And we have `p(r | D, Q) = 1 - p(bar(r) | D, Q)`.

Log-Odds

Recall Baye's Theorem says `p(A|B) = frac(p(B|A)P(A))(P(B))`.
So we can rewrite our starting equations as:
`p(r | D, Q) = (p(D,Q | r)p(r))/(P(D,Q))` and
`p(bar(r) | D, Q) = (p(D,Q | bar(r))p(bar(r)))/(P(D,Q))`.
It is often convenient to switch from a formulation using probabilities which are between 0 and 1 with 0.5 representing even odds, to a formulation where numbers range between `-infty, infty` with `0` representing even odds.
This is called switching to log-odds and can be done using the transformation:
`logit(p) = log(p/(1-p))`.
This transformation is rank/order-preserving.
Applying it to our equation above gives: `log(frac(p(r|D, Q))(1 - p(r|D,Q))) = log(frac(p(r|D, Q))(p(bar(r)|D,Q)))`
`= log(frac(p(D,Q | r)p(r))(p(D,Q | bar(r))p(bar(r))))`.

Generating Queries from documents

Using conditional probabilities we have `p(D,Q | R) = p(Q|D,R) cdot P(D|R)` and applying Baye's Theorem again gives:
`log(frac(p(r|D, Q))(p(bar(r)|D,Q))) = log(frac(p(D,Q | r)p(r))(p(D,Q | bar(r))p(bar(r))))`
`= log(frac(p(Q | D, r)p(D| r)p(r))(p(Q | D, bar(r))p(D|bar(r))p(bar(r))))`
`= log(frac(p(Q | D, r)p(r |D))(p(Q | D, bar(r))p(bar(r)|D)))`
`= log p(Q | D, r) - log p(Q | D, bar(r)) + log(frac(p(r|D))(p(bar(r) |D)))`
`= log p(Q | D, r) - log p(Q | D, bar(r)) + logit(p(r|D))`
`p(Q | D, r)` represents the probability that the user would enter the query `q` in order to retrieve document `d`
So it makes sense that if a term appears in `d` with a frequency higher than random chance it would be more likely to appear in the query.
Next consider `p(Q | D, bar(r))`. This tells us something about the query based on an example of a document `d` which is not relevant to the user. This doesn't seem to tell us much about `Q` and it is reasonable to assume it is independent of the particular non-relevant document `d`. This implies `p(Q | D, bar(r))` is constant and so we can drop it from our equation and still get a rank equivalent formula: `log p(Q | D, r) + logit(p(r|D))`
If we look at `p(r|D)` it is often assumed to be the same for all documents, so again can be dropped without affecting rankings.
We are thus left with `p(Q | D, r)` and we often just make the conditioning on relevance implicit giving `p(Q|D)`, and so we want to estimate this probability for a particular query `q` and document `d`.
This leads to the viewpoint of taking a document as a model for generating the query `q`.

Languages Models

The simplest document language model is the maximum likelihood model:
`M_d^(ml)(t) = f_(t,d)/l_d`
where `f_(t,d)` is the frequency of the term in the document and `l_d` is the documents length.
For example,
`M_(mbox(Hamlet))^(ml)(\l\o\r\d) = 624/43314 = 1.41%`
`M_(mbox(Macbeth))^(ml)(\l\o\r\d) = 78/26807 = 0.291%`
One drawback for using this as a model of `p(q|d)` is that `d` is just a single example of a relevant document and may not consist of many words, so it might give a poor estimate.
It also assigns `0` probability to all terms not appearing in the document, even though it is possible for a person to making a query involving terms not in say Hamlet.

Smoothing

People usually smooth the maximal likelihood model with a model generated from the collection as a whole:
`M_C(t) = l_t/l_C`
Here `l_t` is the number of occurrence of `t` in the whole corpus.
Two common smoothing techniques are Jelinek Mercer smoothing (linear combination):
`M_d^lambda = (1 - lambda) cdot M_d^(ml)(t) + lambda cdot M_C(t)`
and Dirichlet smoothing (imagine increasing the length of `d` by a factor `mu` according to the underlying corpus):
`M_d^mu(t) = frac(f_(t,d) + mu cdot M_C(t))(l_d + mu)`

Quiz

Which of the following is true?

LLRUN is an example global parametric gap compression technique.
Remerging an inverted index is always faster than rebuilding it.
Hybrid index maintenance and logarithmic merging are two names for the same thing.

Ranking with Language Models

Our basic model deals with single terms.
We could calculate:
`p(q|d) = p(|q| = n) cdot prod_(i=1)^n p(t_i |d)`.
People though usually ignore the query length. However, they do allow the same term to appear multiple times in the query, so we get:
`p(q|d) = prod_(i=1)^n p(t_i |d) = prod_(t in q) p(t |d)^(q_t)`
where `q_t` is the number of appearances of `t` in the query.
Rewriting this with one of the language models on the last slide we have either:
`p(q | d) = prod_(t in q)( (1 - lambda) cdot M_d^(ml)(t) + lambda cdot M_C(t))^(q_t)`
or
`p(q | d) = prod_(t in q)(frac(f_(t,d) + mu cdot M_C(t))(l_d + mu))^(q_t)`

Massaging our equations

Let's take the log of our probability and split our equations into the terms that are in the document and those that are not:
`log p(q|d) = log prod_(t in q) p(t |d)^(q_t) = sum_(t in q cap d) q_t cdot log p(t|d) + sum_(t in q\\d) q_t cdot log p(t|d)`
Let `M_d(t)` be either of our models, and suppose we use it to estimate `p(t|d)`. When `t` is not in our document this takes the form `alpha_d M_C(t)`, where `alpha_d` is either `lambda` or `mu/(l_d + mu)`.
If we substitute this back into our log equation above, then do a half page of grundge, we get:
`log p(q|d) = sum_(t in q cap d) q_t log frac(M_d(t))(alpha_d M_C(t)) + nlog alpha_d + sum_(t in q) q_t cdot log M_C(t)`.
The last term is constant for all documents, so can be dropped without affecting rankings and we get:
`[sum_(t in q cap d) q_t log frac(M_d(t))(alpha_d M_C(t))] + nlog alpha_d `

Substituting in a particular model

If we substitute in the Jelinek Mercer smoothing model we get:
`sum_(t in q cap d) q_t cdot log frac((1-lambda)f_(t,d)/l_d + lambda l_t/l_C)(lambda l_t/l_C) + n log lambda`
as the last term is constant, it can be dropped giving the final equation (after some rearranging):
`sum_(t in q) q_t log (1 + (1 - lambda)/lambda cdot f_(t,d)/l_d cdot l_C/l_t)`.
We could then rank documents according to this equation. This is called language modeling with Jelinek-Mercer smoothing (LMJM). People often take `lambda =0.5` in this equation.
For the Dirichlet Model a similar substitution gives the following final equation: `[sum_(t in q) q_t log(1 + f_(t,d)/mu cdot l_C/l_t)] - n cdot log(1 + l_d/mu)`
which is called language modeling with Dirichlet Smoothing (LMD).
A value of `mu = 1000` is often used.

Kullback-Leibler Divergence

Kullback-Leibler Divergence is another approach to language modeling based on the relative entropy of the language model determined by the query and that determined by the document.
For two discrete distributions `f` and `g` the KL divergence is given by the equation:
`sum_x f(x) cdot log frac(f(x))(g(x))`
The value of this equation is not necessarily symmetric in `f` and `g`.
Just as we had a language model for the document, we can imagine making a language model for the query. For example, the max likelihood model would be:
`M_q^(ml)(t) = q_t/n.`
Substituting in models into the KL divergence gives:
`sum_(t in V) M_q(t) cdot log frac(M_q(t))(M_d(t)) = sum_(t in V) M_q(t) cdot log M_q(t) - sum_(t in V) M_q(t) cdot log M_d(t)`.
The first summation is the same for all docs. The second summation without the negative sign increases with decreasing divergence so can be used as a ranking formulas. If we substitute in our max likelihood model, we thus arrive at:
`sum_(t in V) M_q^(ml)(t) cdot log M_d(t) = 1/n sum_(t in q) q_t cdot log M_d(t)`.
If we drop the `1/n` as not effecting the ranking this actually reduces to being the log of our starting point a couple of slides back (i.e., the log of the following after substituting the model):
`p(q|d) = prod_(t in q) p(t |d)^(q_t)`
People often don't use `M_q^(ml)(t)` as the model of the query, but instead do query expansion first, then continue with the derivation we just did.
Lafferty and Zhai (2001) do this where they expand the query using a random walk: Pick a random query term according to the query model, then pick randomly doc with the query term, select using this doc's model a term from the document, add the term to the query. At random choose whether to stop or continue.
Lafferty and Zhai report this method performs better than the non-query expanded model we have presented.

Ranking using Language Models

Outline

BM25F and Pseudo-Relevance Feedback

Modeling Relevance

Log-Odds

Generating Queries from documents

Languages Models

Smoothing

Quiz

Ranking with Language Models

Massaging our equations

Substituting in a particular model

Kullback-Leibler Divergence