Introduction

Last week, we began this course by going over some of the uses of information retrieval: web and desktop search, document management systems, etc.
We then presented a basic architecture whereby a user could query a search engine about some topic and the search engine would use its inverted index to try to retrieve a ranked list of entries specifying documents that might fulfill this information need.
We also talked a little about efficiency and effectiveness of retrieval -- the former is concerned with the speed, volume, and cost of query processing, the latter is concerned about the relevance of the results.
We said relevance is often judged by an assessor who reviews document/topic pairs and assigns either a binary or graded score.
We begin today by discussing an important principle in relevance ranking.

Probability Ranking Principle

The fundamental goal of relevance ranking is often expressed in terms of the Probability Ranking Principle:

If an IR system's response to each query is a ranking of the documents 
in the collection in order of decreasing probability of relevance, then 
the overall effectiveness of the system to its users will be maximized.

This serves as a guiding principle that underlies much research, but nevertheless overlooks such things as the size and scope of the document.
The specificity of the document measures the degree to which its contents are focused on the information need. For example, the document might satisfy the need but also contain lots of other garbage.
The exhaustivity of the document reflects the degree to which it covers the information related to the topic.
The novelty of a document in a document list reflects how much new information the n-th search result gives to the user given the previous results.

Text Formats

Human language data in electronic text format represents the raw material of IR.
To build an IR system it is important to understand the different formats this text can come in.
Example formats include: HTML, Word DOC, Word DOCX, XML, LATeX, RTF, PDF, RSS, etc.
HTML and XML are perhaps the most important of these for writing a search engine.
This is because unlike plain text, these formats allow us to explicitly use links to represent relationships between documents and parts of documents.

Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a 
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet</a></p>
</body>
</html>

XML

XML is a language for specify new tag-based languages. These new languages can be geared to the semantic use at hand.
In HTML the tags give us a little semantic information: We know an h tag is more likely to be short and important than a p tag.
In XML, we can create new languages specific to a domain at hand. For example, in a medical domain, we might create a whole new language called "patient" with tags like name, age, etc which are very specific in the content they hold.
An IR system should be able to make use of this additional information -- especially, if the system knows the XML language in question.

More Text Formats

Here is a brief description of some other text formats:

Microsoft formats - originally, Word (doc), Excel (xls), Powerpoint (ppt) were all proprietary binary formats. Due to lawsuits, the specs for these formats are now available online. Modern versions of Office use .docx, .xlsx, and .pptx files. These files are all really zip files which contain a directory of xml files. They are thus relatively straightforward to parse.
Open Office formats - nowadays, are also a zip file of xml documents.
PDF and Postscript - Postscript is essentially a program for how to render a document on a printer using a version of the language Forth. In PDF, a document consists of a sequence of objects (with potentially embedded Postscript instructions) from one or more container types. The easiest of the object types to extract data from is essentially ZIP compressed text.
Javascript and other web-scripting languages - These programming languages are often used to write html pages on the client-side, and as they are full-fledged programming languages, this makes it difficult for a web-crawler/search-engine to figure out the text or meaning they encapsulate.

Tokenization of English Text

To process a document to build an inverted index for it, we need to convert that document into a sequence of tokens.
As a preliminary step, each of the formats discussed before must be converted to some kind of raw text. This often involves discarding low level things such as formatting tags (for example, br tags), fonts information, etc.
As part of this step we might convert the text to a standard character encoding. For example, we might convert both ASCII and Big-5 to UTF-8.
For English XML, we might process this raw text, treating each successive sequence of consecutive alphanumeric characters as a token.
We might convert tag tokens to upper-case and words tokens to lower-case letters.
Later, we might also run a stemmer on each word.
The set of distinct tokens, or symbols, gotten by the above process is called the vocabulary, `V`, of the document or document collection.
For the works of Shakespeare, `V` might look like:
`V= \{`a, aaron, abaissiez, ..., zounds, zwaggered, ...,<PLAY> ...`\}`
We will often make things like a term-frequency table listing for each term the number of times it occurred.
We will use the word occurrence to refer to a particular time a term occurred. So to spell out an occurrence you need to say which document and where within the document the term happens.

Quiz

Which of the following is true?

An IR system's effectiveness is nowadays measured without using any human judgments of search results.
This semester in CS 267 students will receive five homework scores and an overall quiz score. The lowest of these scores will be dropped.
An inverted indexes is just a sorting of a document collection from the most relevant document to the least relevant document.

Term Distributions

Rank Frequency Token
1     107,833  <LINE>
2     107,833  </LINE>
3      31,081  <SPEAKER>
4      31,081  </SPEAKER>
5      31,028  <SPEECH>
6      31,028  </SPEECH>
7      28,317  the
8      26,022  and
9      22,639  i
10     19,898  to

The above represents a term-frequency table for the 10 most common terms in an XML collection of Shakespeare's plays.
As you can see 'the' is the most common word, followed by other pronouns, conjunctions, prepositions, etc.
8,336 terms, of these 22,943 words and 43 tags in this collection, occur only once. For example, zwaggered occurs once.
The relative frequency of tags is hard to predict, but for English words -- this is true of other languages as well -- there is a predictible relationship between the frequency of a term and its rank called Zipf's Law.
It says that the frequency of the `i`th most common term, `F_i`, will be proportional to `1/i^{alpha}` for some constant `alpha`. (For English, `alpha` is usually close to 1.)
Knowing this law allows us to figure out things like the most important terms in a query. It also can be used to develop a model for the relevance of a term to a document.

Language Modeling

Imagine that a previously unknown Shakespearean play was discovered. Can we predict anything about the content of this play given the contents of the plays we do know?
One technique to make predictions about an unknown text is to use a special kind of probability distribution known as a language model.
A language model, `M`, is a probability distribution over the terms of a vocabulary. That is,
`M: mbox{term} -> mbox{probability of that term}`
So we have: `sum_(sigma in V) M(sigma) = 1`.
For an existing corpus (collection of documents), we can make a simple language model by defining:
`M(sigma) = frac{mbox{frequency}(sigma)}{sum_{sigma' in V}mbox{frequency}(sigma')}`
We can use our model to determine things like the probability of a phrase like "to be or not to be".
To do this we could take the product of the probabilities of each of the terms in the phrase:
`2.18% times 0.76% times 0.27% times 0.93% times 2.18% times 0.76% = 0.000000000069% `
Such an estimate of the probability of a phrase is called a maximum likelihood estimate (MLE).

Unknown Shakespeare

We were interested in the problem: Given an unknown play purportedly by Shakespeare to find out if it was really by him.
In the book's example of texts by Shakespeare, fully a third of the terms in this collection occur only once.
So we would expect any new work by Shakespeare to include words not in the rest of the collection.
To get a language model then that can take this into account, we could extend our vocabulary to `V' = V cup {mbox(UNKNOWN)}`.
We can then extend our model to `M'` by setting:
`M'(mbox(UNKNOWN)) = beta`,
`M'(sigma) = M(sigma) cdot (1-beta)`.
where `M` is the maximum likelihood model and `sigma` is any term in `V`.
To guess a good value of `beta` you might set it to half the probability of a unique term in the existing text:
`beta = 0.5 cdot frac(1)(sum_(sigma' in V) mbox(frequency)(sigma'))`
We can make a vector out of a language model by listing coordinates `langle M'(sigma_1), M'(sigma_2), ... M'(sigma_n) rangle` where `sigma_i`'s range over our vocabulary.
We can use language models for many things in IR. For example, we can view a query as a document and compare its vector to those in a corpus to look for close matches.
We can also use vectors for forensics:
- Given a new purported "Shakespeare" document. We can calculate a vector for the maximum likelihood model of just the new document.
- We can compare the Euclidean distance between this vector and the vector from the MLE model of whole Shakespeare corpus and between this vector and that larger corpus of randomly chosen random documents to guess if this unknown text is really Shakespeare.
- We could then calculate a standard deviation between vectors within the corpus, view some fixed number of deviations from our model as a "ball" of Shakespeareness and see if the unknown text lives in that ball.

Higher-order Models

You can imagine using a language model to generate random text by picking a sequence of words according to the probabilities given by the model.
This would tend to produce very gibberish-like text because words like "the" and "our", which are high probability, would have a high chance of occurring in this random text, but almost never occur next to each other in real English.
Higher-order language models which look at sequences of terms can solve the problem above. MLE can be considered a zero'th order model.
A first-order language model consists of conditional probabilities that depend on the previous symbol. For example: `M_1(sigma_2 | sigma_1) = frac(mbox(frequency)(sigma_1 sigma_2))(sum_(sigma' in V)(mbox(frequency)(sigma_1 sigma')))`
A first-order language model for terms is equivalent to the zero-order model for term bi-grams (pairs) estimated using MLE:
`M_1(sigma_2 | sigma_1) = frac(M_0(sigma_1 sigma_2))(sum_(sigma' in V)(M_0(sigma_1 sigma')))`.
An `n`th order model can be expressed in terms of a zero-order `(n+1)`-gram model:
`M_n(sigma_(n+1) | sigma_1 ... sigma_n) = frac(M_0(sigma_1 ... sigma_(n+1)))(sum_(sigma' in V)(M_0(sigma_1 ... sigma_n sigma')))`.

Example

Consider the phrase "first witch" in the Shakespeare corpus.
This occurs, for instance, to identify which of three witches is speaking in Macbeth.
The phrase appears 23 times; whereas, "first" appears 1,349 times. There are a total of 912,051 bi-grams in Shakespeare.
So we have:
`M_0(mbox("first witch")) = frac(23)(912,051) approx 0.0025%`
`M_1(mbox("witch") | mbox("first")) = frac(23)(1349) approx 1.7%`.
On the other hand, the first order model assigns 0 probability to the phrase "our the", because the phrase never appears in the text.

Text Formats, Tokenization, Term Distributions, Language Models

Outline

Introduction

Probability Ranking Principle

Text Formats

XML

More Text Formats

Tokenization of English Text

Quiz

Term Distributions

Language Modeling

Unknown Shakespeare

Higher-order Models

Example