Introduction

On Monday, we began to exploring the language modeling approach to defining relevance measures.
Our goal was to derive the divergence-from-randomness relevance measure.
Today, we will continue to work towards this goal, deriving two relevance measures on our way and one technique for pseudo-relevance feedback. After the midterm, we will finish our derivation of DFR.
We started off the with the probability ranking principle and asked the basic question: how do we estimate whether a user will judge a document relevant given the document and the query.
We reduced this to the question of ranking according to the probability p(Q|D).
That is, we want to estimate given a document `d` how likely it was someone would enter the query `q` to retrieve that document -- the document is thus treated as providing a model for generating queries.

Languages Models

The simplest document language model is the maximum likelihood model:
`M_d^(ml)(t) = f_(t,d)/l_d`
where `f_(t,d)` is the frequency of the term in the document and `l_d` is the documents length.
For example,
`M_(mbox(Hamlet))^(ml)(\l\o\r\d) = 624/43314 = 1.41%`
`M_(mbox(Macbeth))^(ml)(\l\o\r\d) = 78/26807 = 0.291%`
One drawback for using this as a model of `p(q|d)` is that `d` is just a single example of a relevant document and may not consist of many words, so it might give a poor estimate.
It also assigns `0` probability to all terms not appearing in the document, even though it is possible for a person to making a query involving terms not in say Hamlet.

Smoothing

People usually smooth the maximal likelihood model with a model generated from the collection as a whole:
`M_C(t) = l_t/l_C`
Here `l_t` is the number of occurrence of `t` in the whole corpus.
Two common smoothing techniques are Jelinek Mercer smoothing (linear combination):
`M_d^lambda = (1 - lambda) cdot M_d^(ml)(t) + lambda cdot M_C(t)`
and Dirichlet smoothing (imagine increasing the length of `d` by a factor `mu` according to the underlying corpus):
`M_d^mu(t) = frac(f_(t,d) + mu cdot M_C(t))(l_d + mu)`

In-Class Exercise

Let's see what happens as we vary the document length and `mu` in when doing Dirichlet smoothing...

Suppose the term "star" appears in document `d` of length 250 twice. We smooth the language model of `d` using the Corpus of Contemporary English in which star occurs 73695 times among 450,000,000 words.

Compute `M_d^mu(t)` for the case when (a) `mu=100`, (b) `mu=1000`, (c) `mu=10000`.

Suppose "star" did not appear in `d` and that `mu=1000` what would `M_d^mu(t)` be?

Post your answers to the Nov 7 Discussion Thread.

Ranking with Language Models

Our basic model deals with single terms.
We could calculate:
`p(q|d) = p(|q| = n) cdot prod_(i=1)^n p(t_i |d)`.
People though usually ignore the query length. However, they do allow the same term to appear multiple times in the query, so we get:
`p(q|d) = prod_(i=1)^n p(t_i |d) = prod_(t in q) p(t |d)^(q_t)`
where `q_t` is the number of appearances of `t` in the query.
Rewriting this with one of the language models on the last slide we have either:
`p(q | d) = prod_(t in q)( (1 - lambda) cdot M_d^(ml)(t) + lambda cdot M_C(t))^(q_t)`
or
`p(q | d) = prod_(t in q)(frac(f_(t,d) + mu cdot M_C(t))(l_d + mu))^(q_t)`

Massaging our equations

Let's take the log of our probability and split our equations into the terms that are in the document and those that are not:
`log p(q|d) = log prod_(t in q) p(t |d)^(q_t) = sum_(t in q cap d) q_t cdot log p(t|d) + sum_(t in q\\d) q_t cdot log p(t|d)`
Let `M_d(t)` be either of our models, and suppose we use it to estimate `p(t|d)`. When `t` is not in our document this takes the form `alpha_d M_C(t)`, where `alpha_d` is either `lambda` or `mu/(l_d + mu)`.
If we substitute this back into our log equation above, then do a half page of grundge, we get:
`log p(q|d) = sum_(t in q cap d) q_t log frac(M_d(t))(alpha_d M_C(t)) + nlog alpha_d + sum_(t in q) q_t cdot log M_C(t)`.
The last term is constant for all documents, so can be dropped without affecting rankings and we get:
`sum_(t in q cap d) q_t log frac(M_d(t))(alpha_d M_C(t)) + nlog alpha_d `

Substituting in a particular model

If we substitute in the Jelinek Mercer smoothing model we get:
`sum_(t in q cap d) q_t cdot log frac((1-lambda)f_(t,d)/l_d + lambda l_t/l_C)(lambda l_t/l_C) + n log lambda`
as the last term is constant, it can be dropped giving the final equation (after some rearranging):
`sum_(t in q) q_t log (1 + (1 - lambda)/lambda cdot f_(t,d)/l_d cdot l_C/l_t)`.
We could then rank documents according to this equation. This is called language modeling with Jelinek-Mercer smoothing (LMJM). People often take `lambda =0.5` in this equation.
For the Dirichlet Model a similar substitution gives the following final equation: `sum_(t in q) q_t log(1 + f_(t,d)/mu cdot l_C/l_t) - n cdot log(1 + l_d/mu)`
which is called language modeling with Dirichlet Smoothing (LMD).
A value of `mu = 1000` is often used.

Kullback-Leibler Divergence

Kullback-Leibler Divergence is another approach to language modeling based on the relative entropy of the language model determined by the query and that determined by the document.
For two discrete distributions `f` and `g` the KL divergence is given by the equation:
`sum_x f(x) cdot log frac(f(x))(g(x))`
The value of this equation is not necessarily symmetric in `f` and `g`.
Just as we had a language model for the document, we can imagine making a language model for the query. For example, the max likelihood model would be:
`M_q^(ml)(t) = q_t/n.`
Substituting in models into the KL divergence gives:
`sum_(t in V) M_q(t) cdot log frac(M_q(t))(M_d(t)) = sum_(t in V) M_q(t) cdot log M_q(t) - sum_(t in V) M_q(t) cdot log M_d(t)`.
The first summation is the same for all docs. The second summation without the negative sign increases with decreasing divergence so can be used as a ranking formulas. If we substitute in our max likelihood model, we thus arrive at:
`sum_(t in V) M_q^(ml)(t) cdot log M_d(t) = 1/n sum_(t in q) q_t cdot log M_d(t)`.
If we drop the `1/n` as not effecting the ranking this actually reduces to being the log of our starting point a couple of slides back (i.e., the log of the following after substituting the model):
`p(q|d) = prod_(t in q) p(t |d)^(q_t)`
People often don't use `M_q^(ml)(t)` as the model of the query, but instead do query expansion first, then continue with the derivation we just did.
Lafferty and Zhai (2001) do this where they expand the query using a random walk: Pick a random query term according to the query model, then pick randomly doc with the query term, select using this doc's model a term from the document, add the term to the query. At random choose whether to stop or continue.
Lafferty and Zhai report this method performs better than the non-query expanded model we have presented.

Ranking using Language Models

Outline