Introduction

Last week, we had begun to explore the language modeling approach to coming up with relevance measures.
One of our goals was to explain the divergence-from-randomness relevance measure.
We started off the with the probability ranking principle and asked the basic question: how do we estimate whether a user will judge a document relevant given the document and the query.
We reduced this to the question of ranking according to the probability p(Q|D).
That is, we want to estimate given a document `d` how likely it was someone would enter the query `q` to retrieve that document -- the document is thus treated as providing a model for generating queries.

Languages Models

The simplest document language model is the maximum likelihood model:
`M_d^(ml)(t) = f_(t,d)/l_d`
where `f_(t,d)` is the frequency of the term in the document and `l_d` is the documents length.
For example,
`M_(mbox(Hamlet))^(ml)(\l\o\r\d) = 624/43314 = 1.41%`
`M_(mbox(Macbeth))^(ml)(\l\o\r\d) = 78/26807 = 0.291%`
One drawback for using this as a model of `p(q|d)` is that `d` is just a single example of a relevant document and may not consist of many words, so it might give a poor estimate.
It also assign `0` probability to all terms not appearing in the document, even though it is possible for a person to making a query involving terms not in say Hamlet.

Smoothing

People usually smooth the maximal likelihood model with a model generated from the collection as a whole:
`M_C(t) = l_t/l_C`
Two common smoothing techniques are Jelinek Mercer smoothing (linear combination):
`M_d^lambda = (1 - lambda) cdot M_d^(ml)(t) + lambda cdot M_C(t)`
and Dirichlet smoothing (imagine increasing the length of `d` by a factor `mu` according to the underlying corpus):
`M_d^mu(t) = frac(f_(t,d) + mu cdot M_C(t))(l_d + mu)`

Quiz

Which of the following is true?

The I/O complexity of NO MERGE is quadratic in the number of tokens
Hybrid Index Maintenance can have non-linear performance
The log-odds transformation does not preserve rankings.

Ranking with Language Models

Our basic model deals with single terms.
We could calculate:
`p(q|d) = p(|q| = n) cdot prod_(i=1)^n p(t_i |d)`.
People though usually ignore the query length, however, do allow the same term to appear multiple times in the query, so we get:
`p(q|d) = prod_(i=1)^n p(t_i |d) = prod_(t in q) p(t |d)^(q_t)`
where `q_t` is the number of appearances of `t` in the query.
Rewriting this with one of the language models on the last slide we have either:
`p(q | d) = prod_(t in q)( (1 - lambda) cdot M_d^(ml)(t) + lambda cdot M_C(t))^(q_t)`
or
`p(q | d) = prod_(t in q)(frac(f_(t,d) + mu cdot M_C(t))(l_d + mu))^(q_t)`

Massaging our equations

Let's take the log of our probability and split our equations into the terms that are in the document and those that are not:
`log p(q|d) = log prod_(t in q) p(t |d)^(q_t) = sum_(t in q cap d) q_t cdot log p(t|d) + sum_(t in q\\d) q_t cdot log p(t|d)`
Let `M_d(t)` be either of our models and suppose we use it to estimate `p(t|d)`. When `t` is not in our document this takes the form `alpha_d M_C(t)`, where `alpha_d` is either `lambda` or `mu/(l_d + mu)`.
If we substitute this back into our log equation above, then do a half page of grundge, we get:
`log p(q|d) = sum_(t in q cap d) q_t log frac(M_d(t))(alpha_d M_C(t)) + nlog alpha_d + sum_(t in q\\d) q_t cdot log M_C(t)`.
The last term is constant for all documents, so can be dropped without affecting rankings and we get:
`sum_(t in q cap d) q_t log frac(M_d(t))(alpha_d M_C(t)) + nlog alpha_d `

Substituting in a particular model

If we substitute in the Jelinek Mercer smoothing model we get:
`sum_(t in q cap d) q_t cdot log frac((1-lambda)f_(t,d)/l_d + lambda l_t/l_C)(lambda l_t/l_C) + n log lambda`
as the last term is constant, it can be dropped giving the final equation (after some rearranging):
`sum_(t in q) q_t log (1 + (1 - lambda)/lambda cdot f_(t,d)/l_d cdot l_C/l_t)`.
We could then rank documents according to this equation. This is called language modeling with Jelinek-Mercer smoothing (LMJM). People often take `lambda =0.5` in this equation.
For the Dirichlet Model a similar substitution gives the following final equation: `sum_(t in q) q_t log(1 + f_(t,d)/mu cdot l_C/l_t) - n cdot log(1 + l_d/mu)`
which is called language modeling with Dirichlet Smoothing (LMD).
A value of `mu = 1000` is often used.

Kullback-Leibler Divergence

Kullback-Leibler Divergence is another approach to language modeling based on the relative entropy of the language model determined by the query and that determined by the document.
For two discrete distributions `f` and `g` the KL divergence is given by the equation:
`sum_x f(x) cdot log frac(f(x))(g(x))`
The value equation is not necessarily symmetric in `f` and `g`.
Just as we had a language model for the document, we can imagine making a language model for the query. For example, the max likelihood model would be:
`M_q^(ml)(t) = q_t/n.`
Substituting in models into the KL divergence gives:
`sum_(t in V) M_q(t) cdot log frac(M_q(t))(M_d(t)) = sum_(t in V) M_q(t) cdot log M_q(t) - sum_(t in V) M_q(t) cdot log M_d(t)`.
The first summmation is the same for all docs. The second summation without the negative sign increases with decreasing divergerce so can be used as a ranking formulas. If we substiute in our max likelihood model, we thus arrive at:
`sum_(t in V) M_q^(ml)(t) cdot log M_d(t) = 1/n sum_(t in q) q_t cdot log M_d(t)`.
If we drop the `1/n` as not effecting the ranking this actually reduces to being the log of our starting point a couple of slides back (i.e., the log of the following after substituting the model):
`p(q|d) = prod_(t in q) p(t |d)^(q_t)`
People often don't use `M_q^(ml)(t)` as the model of the query, but instead do query expansion first, then continue with the derivation we just did.
Lafferty and Zhai (2001) do this where they expand the query using a random walk: Pick a random query term according to the query model, then pick randomly doc with the query term, select using this doc's model a term from the document, add the term to the query. At random choose whether to stop or coninue.
Lafferty and Zhai report this method performs better than the non-query expanded model we have presented.

Language Modeling, KL Divergence

Outline

Introduction

Languages Models

Smoothing

Quiz

Ranking with Language Models

Massaging our equations

Substituting in a particular model

Kullback-Leibler Divergence