Higher-order Models

Let's recall the notion of higher-order language model we introduced on Monday ...

You can imagine using a language model to generate random text by picking a sequence of words according to the probabilities given by the model.
This would tend to produce very gibberish-like text because words like "the" and "our", which are high probability, would have a high chance of occurring in this random text, but almost never occur next to each other in real English.
Higher-order language models which look at sequences of terms can solve the problem above. MLE can be considered a zero'th order model.
A first-order language model consists of conditional probabilities that depend on the previous symbol. For example: `M_1(sigma_2 | sigma_1) = frac(mbox(frequency)(sigma_1 sigma_2))(sum_(sigma' in V)(mbox(frequency)(sigma_1 sigma')))`
A first-order language model for terms is equivalent to the zero-order model for term bi-grams (pairs) estimated using MLE:
`M_1(sigma_2 | sigma_1) = frac(M_0(sigma_1 sigma_2))(sum_(sigma' in V)(M_0(sigma_1 sigma')))`.
An `n`th order model can be expressed in terms of a zero-order `(n+1)`-gram model:
`M_n(sigma_(n+1) | sigma_1 ... sigma_n) = frac(M_0(sigma_1 ... sigma_(n+1)))(sum_(sigma' in V)(M_0(sigma_1 ... sigma_n sigma')))`.

Example

Consider the phrase "first witch" in the Shakespeare corpus.
This occurs, for instance, to identify which of three witches is speaking in Macbeth.
The phrase appears 23 times; whereas, "first" appears 1,349 times. There are a total of 912,051 bi-grams in Shakespeare.
So we have:
`M_0(mbox("first witch")) = frac(23)(912,051) approx 0.0025%`
`M_1(mbox("witch") | mbox("first")) = frac(23)(1349) approx 1.7%`.
On the other hand, the first order model assigns 0 probability to the phrase "our the", because the phrase never appears in the text.

Smoothing

Notice although "our the" does not occur in usual speech, we have been using it a whole bunch in our discussion.
So maybe in the real word the probability of this bi-gram should be small but non-zero.
Other phrases like "fourth witch" seem plausible but as they don't appear as a bi-grams in the Shakespeare corpus would be assigned probability `0` in our first-order model.
One way to modify our first-order model to improve it for these cases is to smooth it with the zero-order model.
We define a smoothed first-order `M_1'` as:
`M_1'(sigma_2 | sigma_1) = gamma cdot M_1(sigma_2 | sigma_1) + (1 - gamma) cdot M_0(sigma_2)`
and
`M_0'(sigma_1 sigma_2) = gamma cdot M_0(sigma_1 sigma_2) + (1 - gamma)cdot M_0(sigma_1) cdot M_0(sigma_2)`,
where `0 le gamma le 1` is a smoothing parameter.
As an example, if `gamma = 0.5`, the zero-order, smoothed bi-gram model gives "first witch" a probability now of 0.0013% -- less than before.
This is because it now assigns bi-grams like "fourth witch" a non-zero probability, in this case 0.00000030%.
Often it is the case that we will smooth a small collection from say an author with a larger collection say a big corpus of English text. This might be done with an equation like:
`M_(S,0)' = gamma cdot M_(S,0) + (1 - gamma) cdot M_(L,0)`.

Markov Models

The above picture is an example of a Markov Model, another important method for representing term distributions.
A Markov Model is essentially a finite-state machine augmented with transition probabilities.
When used as a language model, each transition is labeled with a term, in addition to a probability.
Following a transition corresponds to predicting or generating that term.
Starting in state `1`, the string "to be or not to be" could be generated by following the state sequence:
`1 -> 2 -> 3 -> 4 -> 1 -> 2 -> 3`.
The associated probability would be: `0.40 times 0.30 times 0.20 times 0.10 times 0.40 times 0.30 = 0.029%`.
Missing transitions are equivalent to transitions with `0` probability.
The probability predicted depends on the starting state. If we had started with state `4` rather than `1`, the probability of "to be or not to be" would have been: `0.065`.

More Markov Models

An `n`-state Markov model can be represented by an `n times n` transition matrix `M`, where `M[i][j]` gives the probability of transitioning from state `i` to state `j`.
The transition matrix corresponding to the Markov model of the last slide is:
` ((0.00, 0.40, 0.00, 0.60), (0.70, 0.00, 0.30, 0.00), (0.00, 0.80, 0.00, 0.20), (0.10, 0.90, 0.00, 0.00))`
Given a transition matrix we can computer the outcome of a transition by multiplying the transition matrix by a state vector representing the current state. For example, if we start initially in state `(1, 0, 0, 0)`. After one step, we have:
`(1, 0, 0, 0) cdot ((0.00, 0.40, 0.00, 0.60), (0.70, 0.00, 0.30, 0.00), (0.00, 0.80, 0.30, 0.20), (0.10, 0.90, 0.00, 0.00)) = (0.00, 0.40, 0.00, 0.60)`
Multiplying again by the matrix gives us the state after two steps: `(0.00, 0.40, 0.00, 0.60) cdot ((0.00, 0.40, 0.00, 0.60), (0.70, 0.00, 0.30, 0.00), (0.00, 0.80, 0.30, 0.20), (0.10, 0.90, 0.00, 0.00)) = (0.34, 0.54, 0.12, 0.00)`

HW1 -- Exercise 1.4

Exercise 1.4 Starting in an unknown state, the Markov model above generates "to be". What state or states could be the current state of the model after generating this text?

Answer. In the above diagram, only the states 1 and 4 have a non-zero probability transition on the word "to". For both of these states, there is exactly one non-zero probability transition on this word, and it goes to state 2. From state 2, there is exactly one non-zero probability transition on the word "be" and it is to state 3. Therefore, starting in an unknown state, if the Markov model generates "to be", it must be in state 3.

Test Collections

Researchers have developed many substantial collections for evaluating IR systems.
Many of these collections have been created as part of TREC, a series of evaluation efforts conducted annually since 1991 by the U.S. National Institute of Standards and Technology (NIST).
TREC provides a forum for researchers to test their IR systems on a broad range of problems. For example, more than 100 groups from universities, industry, and government participated in TREC 2007.
Each year TREC experiments are structured into six or seven tracks, each devoted to a different area of information retrieval.
For example, enterprise search, genomic IR, legal discovery, e-mail spam filtering, blog search, etc.
TREC focuses researchers on particular common problems and it provides a set of reusable tests that researchers can conduct experiments with.
Europe has two similar conferences INEX for XML and CLEF for multi-lingual support; Japan has NTCIR for Asian language IR; and India has something called FIRE.

TREC Tasks

Basic search tasks, in which the IR system returns a ranked list from a static set of documents, are called adhoc tasks in TREC.
Along with a set of documents, a test collection for an adhoc task consists of a set of relevance judgments (known as "qrels files" or "qrels"), indicating documents that are relevant or not to the given topic.

Older TREC tasks, before 2000, were often taken from things like newspaper articles. For example, you might have a collection of documents like:

<DOC>
<DOCNO>LA051990-0141<DOCNO>
<HEADLINE>COUNCIL VOTES TO EDUCATE DOG OWNERS</HEADLINE>
<P>
The City Ccouncil stepped carefully around enforcement of the dog-curbing 
ordinance this week, vetoing the use of police to enforce the law.
</P>
...
</DOC>

Each document is surrounded by DOC tags and DOCNO serves as a unique identifier for the document. This makes it easy to combine collections.
For newer collection adhoc tasks are often taken from the web.
Until 2009, GOV2 which consisted of a collection of 25,000,000 U.S. Government web pages from early 2004 was the largest collection used in TREC. It took about 426GB of storage.
Since 2009, the ClueWeb09 data set is the largest used by TREC.
It consists of an open web crawl of 1 billion web pages and take 5TB of compressed disk space, 25 TB of uncompressed space.
In a given year each track consist of about 50 new topics. Participants are require to freeze their systems before downloading the topics. They then can create queries from the topics, run these queries against the document set, are return ranked lists of results for NIST evaluation.

Open-Source IR Systems

We next look at some open-source IR systems.
Three well-known one are:
- Lucene - this is the indexing library associated with Nutch/Lucence/Solr. It is written in Java and was developed by Doug Cutting beginning in 1997. It is a project now of the Apache Foundation. It is used by Wikipedia. We will look at installing and using it on Monday.
- Indri - this project was developed ar UMass and is part of the Lemur project of UMass and CMU. Indri is written in C++. It, like Lucene, can handle multiple field per document, such as title, body, anchor text. It supports automatic query expansion by pseudo-relevance feedback. It also uses techniques prefer more recent documents over less recent documents when scoring search results.
- Wumpus - is an Academic search engine developed at the U Waterloo by one of the book's authors. It is written in C++. Results from it are used throughout the book.
- Yioop! -- an open-source search engine developed by me with some input by students here at SJSU. It is written in PHP.
Next Wednesday I will go over how to configure Yioop! and Nutch and how to do a crawl with them.

Inverted Indices

diagram with dictionary and posting list of an inverted index

An inverted index provides a mapping between terms and their locations in a text collection `C`.
The two main components of such an index are: (1) a dictionary which lists terms contained in the vocabulary of the collection and (2) for each term in the dictionary, a posting list which gives the positions in the collection at which the term occurs.
The diagram above is an example of a schema-independent index because it makes no assumption about the structure of the underlying text. In particular, it gives a raw position number rather than assume the collection is split into documents and further positions within documents.
An inverted index can be viewed as an abstract data type with the following four methods:
- first(t) returns the first position at which the term `t` occurs in the collection.
- last(t) returns the last position at which the term `t` occurs in the collection.
- next(t, current) returns the position of the first occurence of `t` after the position current in the collection.
- prev(t, current) returns the position of the first occurence of `t` before the position current in the collection.
For example, in the above diagram: first("hurlyburly") = 316669, last("thunder") = 1247139, next("witch", 745451)=745467, prev("witch", 745451) = 745429.
A sequential scan of a posting list might be implemented by applying the first function on the term then repeatedly applying the next function till the end of the posting list.

ADT Example: Phrase Search

As an example of using the primitives of our ADT, consider the problem of phrase search.
A phrase search is a search for an exact match of a phrase in a document. For example when we search on "first witch" in quotes we want back only those documents that have the phrase "first witch" not documents which have both terms but not adjacent and in the given order.

This could be implemented using our ADT with the following pseudo-code:

nextPhrase(t[1],t[2], .., t[n], position)
{
   v:=position
   for i = 1 to n do
     v:= next(t[i], v)
   if v == infty then // infty represents after the end of the posting list
      return [infty, infty]
   u := v
   for i := n-1 downto 1 do
     u := prev(t[i],u)
   if(v-u == n - 1) then
     return [u, v]
   else 
     return nextPhrase(t[1],t[2], .., t[n], u) 
}

Language Modeling, Test Collections, Open-Source IR Systems, Inverted Indexes

Outline