Introduction

On Monday, we were talking about two different strategies to maintain contiguous, dynamic inverted indexes: IMMEDIATE MERGE and INPLACE updates
We said although IMMEDIATE MERGE keeps postings lists all together, the update operation is quadratic in the number of tokens and so is much slower than NO MERGE for index updates.
With the INPLACE strategy, when we write our partition to disk, for newly seen tokens if the posting list is size `b`, we allocate `k cdot b` space of disk for the list.
Thereafter, when we write new partitions that contain the token, we write the new posting list fragment to the left over space in the `k cdot b` block we initially allocated.
If all `k cdot b` bytes fill, we invalidate this memory, let `b' = k cdot b` and allocate `k cdot b'` new space and copy the postings there. And so on...
Notice when we write a partition now, we have this highly non-sequential disk access pattern to look up existing postings blocks. For this reason, updates with this strategy also tend to be slow.
So maybe a hybrid of IMMEDIATE MERGE and INPLACE might work better?

Hybrid Index Maintenance

IMMEDIATE MERGE works well when the size of the new postings we are adding is not too much smaller than the entire posting list for the term.
INPLACE works better when the number of disk accesses is kept low.
The book describes a hybrid set-up where we have two indexes a merge index and a inplace index.
Initially, all new terms are placed into the merge index.
When a list size reaches a certain threshold (based on the disk seek time ~ 0.5MB for typical hard drives), it is switched from the merge to the inplace index.
Since most terms never reach this threshold, for the majority of terms we use IMMEDIATE MERGE and this is fast because the posting lists in question are all short.
Since there are only a few longer postings lists, we don't need to perform that many seeks and we achieve the performance benefits of INPLACE for these longer posting lists.
The book shows this approach's time complexity is only slightly non-linear in the number of tokens.

Logarithmic Merging

One drawback with the hybrid approach is that the amount of non-linearity in the run-time is bounded by `Theta(N^(1 + 1/alpha)/M)`, where N is the size of the collection, M is the amount of memory, and `alpha` depends on the Zipf-power of the distribution of the collection.
So the behavior of this algorithm tends to be worse if we don't have much memory.
To get better performance, in lower memory settings, one typically has to settle for non-contiguous indexes of some sort.
One strategy that yields relatively few non-contiguous indexes is to use logarithmic merging...

How Logarithmic Merging Works

To begin, we imagine our index is split into generations `g` which are whole numbers `1, 2 ... `
When we write a partition, the postings go on disk into a generation 1 index.
We then check are there two generation 1 indexes? If yes, we merge them to make a generation 2 index and then delete the generation 1 indexes.
We then repeat the process by checking if we now have two generation 2 indexes. If so, merge them and so on.
At any given time, we will have at most logarithmic in the number of partitions we have written many active generations.
I.e., a small number, so to look up a token we can look it up in each of these indexes and concatenate these posting lists.
The number of postings transferred from/to disk using this strategy is `Theta(N cdot log (N/M))`

In-Class Exercise

Suppose we can index 50,000 documents before we run out of memory and need to merge with a disk based index that uses logarithmic merging.
We want to index a billion pages, using 6 machines each responsible for having the index for 1/6 of the documents.
What's the largest generation number that will be used (assuming the in memory generation is generation 0)?
What is the most generations that will need to be merged when we add have to flush the in memory partition to disk?
Post your answers to the Oct 30 In-Class Exercise Thread.

BM25F and Pseudo-Relevance Feedback

We are now going to return to the topic of relevance measures for documents.
For lack of time we are largely going to skip over the material in Chapter 8.
Chapter 8 gives the derivation of BM25, which you recall was the following equations for relevance: `S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where
`IDF(t) = log(frac(N)(N_t))`, and
`TF_(BM25) = frac(f_(t,d) \cdot (k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))`
A couple of notions from Chapter 8, I'd still like to briefly mention are BM25F and Pseudo Relevance feedback.
The idea behind BM25F (F is for field weights) is to split documents in our collection into different components. For example, title and body
We then calculate the BM25 score for just the title words; similarly, the BM25 score for just the body texts.
We then take a weighted sum of these two scores. The book gives an example where titles are weight 10, and body's weighted 1. For GOV2, BM25F doesn't perform much better but for the TREC 2006 web tasks, BM25F yields about a 15-20% improvement in the top 10 results.
Pseudo-Relevance feedback is where we perform an initial query and compute the top `m` results. We then use our scoring system to find the terms for which these results are the most relevant up to a term selection value.
We then add these terms to our query and compute our query results based on this expanded query.
The book shows this yields an improvement of about 20% over straight BM25 for P@10 and MAP scores on GOV2.

Modeling Relevance

We now begin our discussion of how to formally model relevance.
The BM25 approach to this is a so-called probabilistic retrieval model; the approach we are going to begin looking at today (divergence-from-randomness) is called a language modeling approach.
The starting point for our derivations is the so-called Probability Ranking Principle:
If an IR system's response to each query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its user will be maximized.
To make this into a starting equation for a relevance measure, we ask instead the "Basic Question":
What is the probability that a user will judge this document relevant to this query?
This can be represented as the equations:
`p(R=1 | D=d, Q=q)`
We will often write `r` to mean `R=1` and `bar(r)` to mean `R=0`, just `D` and `Q` for `D=d` and `Q=q`. So we get the shorter to write version of the above:
`p(r | D, Q)` And we have `p(r | D, Q) = 1 - p(bar(r) | D, Q)`.

Log-Odds

Recall Baye's Theorem says `p(A|B) = frac(p(B|A)P(A))(P(B))`.
So we can rewrite our starting equations as:
`p(r | D, Q) = (p(D,Q | r)p(r))/(P(D,Q))` and
`p(bar(r) | D, Q) = (p(D,Q | bar(r))p(bar(r)))/(P(D,Q))`.
It is often convenient to switch from a formulation using probabilities which are between 0 and 1 with 0.5 representing even odds, to a formulation where numbers range between `-infty, infty` with `0` representing even odds.
This is called switching to log-odds and can be done using the transformation:
`logit(p) = log(p/(1-p))`.
This transformation is rank/order-preserving.
Applying it to our equation above gives: `log(frac(p(r|D, Q))(1 - p(r|D,Q))) = log(frac(p(r|D, Q))(p(bar(r)|D,Q)))`
`= log(frac(p(D,Q | r)p(r))(p(D,Q | bar(r))p(bar(r))))`.

Generating Queries from documents

Using conditional probabilities we have `p(D,Q | R) = p(Q|D,R) cdot P(D|R)` and applying Baye's Theorem again gives:
`log(frac(p(r|D, Q))(p(bar(r)|D,Q))) = log(frac(p(D,Q | r)p(r))(p(D,Q | bar(r))p(bar(r))))`
`= log(frac(p(Q | D, r)p(D| r)p(r))(p(Q | D, bar(r))p(D|bar(r))p(bar(r))))`
`= log(frac(p(Q | D, r)p(r |D))(p(Q | D, bar(r))p(bar(r)|D)))`
`= log p(Q | D, r) - log p(Q | D, bar(r)) + log(frac(p(r|D))(p(bar(r) |D)))`
`= log p(Q | D, r) - log p(Q | D, bar(r)) + logit(p(r|D))`
`p(Q | D, r)` represents the probability that the user would enter the query `q` in order to retrieve document `d`
So it makes sense that if a term appears in `d` with a frequency higher than random chance it would be more likely to appear in the query.
Next consider `p(Q | D, bar(r))`. This tells us something about the query based on an example of a document `d` which is not relevant to the user. This doesn't seem to tell us much about `Q` and it is reasonable to assume it is independent of the particular non-relevant document `d`. This implies `p(Q | D, bar(r))` is constant and so we can drop it from our equation and still get a rank equivalent formula: `log p(Q | D, r) + logit(p(r|D))`
If we look at `p(r|D)` it is often assumed to be the same for all documents, so again can be dropped without affecting rankings.
We are thus left with `p(Q | D, r)` and we often just make the conditioning on relevance implicit giving `p(Q|D)`, and so we want to estimate this probability for a particular query `q` and document `d`.
This leads to the viewpoint of taking a document as a model for generating the query `q`, which we will continue talking about next week.

Logarithmic Merging, BM25F, PRF

Outline

Introduction

Hybrid Index Maintenance

Logarithmic Merging

How Logarithmic Merging Works

In-Class Exercise

BM25F and Pseudo-Relevance Feedback

Modeling Relevance

Log-Odds

Generating Queries from documents