Introduction

Last week, we began this course by going over some of the uses of information retrieval: web and desktop search, document management systems, etc.
We then presented a basic architecture whereby a user could query a search engine about some topic and the search engine would use its inverted index to try to retrieve a ranked list of entries specifying documents that might fulfill this information need.
We stated the probability ranking principle which roughly says we should rank our results in the order that is mostly to satisfy the user's information need and that this ordering should take into account results seen so far.
We continue our introduction to information retrieval by now discussing text formats which might provide the inputs to an information retrieval system as well as how to model the languages encapsulated by a set of documents.

Text Formats

Human language data in electronic text format represents the raw material of IR.
To build an IR system it is important to understand the different formats this text can come in.
Example formats include: HTML, Word DOC, Word DOCX, XML, LATeX, RTF, PDF, RSS, etc.
HTML and XML are perhaps the most important of these for writing a search engine.
This is because unlike plain text, these formats allow us to explicitly use links to represent relationships between documents and parts of documents.

Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a 
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet</a></p>
</body>
</html>

XML

XML also has a notion of link -- XLink.
XML is a language for specify new tag-based languages. These new languages can be geared to the semantic use at hand.
In HTML the tags give us a little semantic information: We know an h tag is more likely to be short and important than a p tag.
In XML, we can create new languages specific to a domain at hand. For example, in a medical domain, we might create a whole new language called "patient" with tags like name, age, etc which are very specific in the content they hold.
An IR system should be able to make use of this additional information -- especially, if the system knows the XML language in question.

More Text Formats

Here is a brief description of some other text formats:

Microsoft formats - originally, Word (doc), Excel (xls), Powerpoint (ppt) were all proprietary binary formats. Due to lawsuits, the specs for these formats are now available online. Modern versions of Office use .docx, .xlsx, and .pptx files. These files are all really zip files which contain a directory of xml files. They are thus relatively straightforward to parse.
Open Office formats - nowadays, are also a zip file of xml documents.
PDF and Postscript - Postscript is essentially a program for how to render a document on a printer using a version of the language Forth. In PDF, a document consists of a sequence of objects (with potentially embedded Postscript instructions) from one or more container types. The easiest of the object types to extract data from is essentially ZIP compressed text.
Javascript or Flash - These programming languages are often used to write html pages on the client-side, and as they are full-fledged programming languages, this makes it difficult for a web-crawler/search-engine to figure out the text or meaning they encapsulate.

Quiz

Which of the following is true?

This semester, in CS 267, students will receive five homework scores and an overall quiz score. The lowest of these scores will be dropped.
Text summarization is in no way connected with information retrieval.
One of the most important data structures in information retrieval is the inverted egg timer.

Tokenization of English Text

To process a document to build an inverted index for it, we need to convert that document into a sequence of tokens.
As a preliminary step, each of the formats discussed before must be converted to some kind of raw text. This often involves discarding low level things such as formatting tags (for example, br tags), fonts information, etc.
As part of this step we might convert the text to a standard character encoding. For example, we might convert both ASCII and Big-5 to UTF-8.
For English XML, we might process this raw text, treating each successive sequence of consecutive alphanumeric characters as a token.
We might convert tag tokens to upper-case and words tokens to lower-case letters.
Later, we might also run a stemmer on each word.
The set of distinct tokens, or symbols, gotten by the above process is called the vocabulary, `V`, of the document or document collection.
For the works of Shakespeare, `V` might look like:
`V= \{`a, aaron, abaissiez, ..., zounds, zwaggered, ...,<PLAY> ...`\}`
We will often make things like a term-frequency table listing for each term the number of times it occurred.
We will use the word occurrence to refer to a particular time a term occurred. So to spell out an occurrence you need to say which document and where within the document the term happens.

Term Distributions

Rank Frequency Token
1     107,833  <LINE>
2     107,833  </LINE>
3      31,081  <SPEAKER>
4      31,081  </SPEAKER>
5      31,028  <SPEECH>
6      31,028  </SPEECH>
7      28,317  the
8      26,022  and
9      22,639  i
10     19,898  to

The above represents a term-frequency table for the 10 most common terms in an XML collection of Shakespeare's plays.
As you can see 'the' is the most common word, followed by other pronouns, conjunctions, prepositions, etc.
Of the 22,943 words and 43 tags in the Shakespeare collection, 8,336 terms occur only once. For example, zwaggered occurs once.
The relative frequency of tags is hard to predict, but for English words -- this is true of other languages as well -- there is a predictible relationship between the frequency of a term and its rank called Zipf's Law.
It says that the frequency of the `i`th most common term, `F_i`, will be proportional to `1/i^{alpha}` for some constant `alpha`. (For English, `alpha` is usually close to 1.)
Knowing this law allows us to figure out things like the most important terms in a query. It also can be used to develop a model for the relevance of a term to a document.

Language Modeling

Imagine that a previously unknown Shakespearean play was discovered. Can we predict anything about the content of this play given the contents of the plays we do know?
One technique to make predictions about an unknown text is to use a special kind of probability distribution known as a language model.
A language model, `M`, is a probability distribution over the terms of a vocabulary. That is,
`M: mbox{term} -> mbox{probability of that term}`
So we have: `sum_(sigma in V) M(sigma) = 1`.
For an existing corpus (collection of documents), we can make a simple language model by defining:
`M(sigma) = frac{mbox{frequency}(sigma)}{sum_{sigma' in V}mbox{frequency}(sigma')}`
We can use our model to determine things like the probability of a phrase like "to be or not to be".
To do this we could take the product of the probabilities of each of the terms in the phrase:
`2.18% times 0.76% times 0.27% times 0.93% times 2.18% times 0.76% = 0.000000000069% `
Such an estimate of the probability of a phrase is called a maximum likelihood estimate (MLE).

Unknown Shakespeare

We were interested in the problem: Given an unknown play purportedly by Shakespeare to find out if it was really by him.
In the book's example of texts by Shakespeare, fully a third of the terms in this collection occur only once.
So we would expect any new work by Shakespeare to include words not in the rest of the collection.
To get a language model then that can take this into account, we could extend our vocabulary to `V' = V cup {mbox(UNKNOWN)}`.
We can then extend our model to `M'` by setting:
`M'(mbox(UNKNOWN)) = beta`,
`M'(sigma) = M(sigma) cdot (1-beta)`.
where `M` is the maximum likelihood model and `sigma` is any term in `V`.
To guess a good value of `beta` you might set it to half the probability of a unique term in the existing text:
`beta = 0.5 cdot frac(1)(sum_(sigma' in V) mbox(frequency)(sigma'))`
We can make a vector out of a language model by listing coordinates `langle M'(sigma_1), M'(sigma_2), ... M'(sigma_n) rangle` where `sigma_i`'s range over our vocabulary.
We can use language models for many things in IR. For example, we can view a query as a document and compare its vector to those in a corpus to look for close matches.
We can also use vectors for forensics:
- Given a new purported "Shakespeare" document. We can calculate a vector for the maximum likelihood model of just the new document.
- We can compare the Euclidean distance between this vector and the vector from the MLE model of whole Shakespeare corpus and between this vector and that larger corpus of randomly chosen random documents to guess if this unknown text is really Shakespeare.
- We could then calculate a standard deviation between vectors within the corpus, view some fixed number of deviations from our model as a "ball" of Shakespeareness and see if the unknown text lives in that ball.

Higher-order Models

You can imagine using a language model to generate random text by picking a sequence of words according to the probabilities given by the model.
This would tend to produce very gibberish-like text because words like "the" and "our", which are high probability, would have a high chance of occurring in this random text, but almost never occur next to each other in real English.
Higher-order language models which look at sequences of terms can solve the problem above. MLE can be considered a zero'th order model.
A first-order language model consists of conditional probabilities that depend on the previous symbol. For example: `M_1(sigma_2 | sigma_1) = frac(mbox(frequency)(sigma_1 sigma_2))(sum_(sigma' in V)(mbox(frequency)(sigma_1 sigma')))`
A first-order language model for terms is equivalent to the zero-order model for term bi-grams (pairs) estimated using MLE:
`M_1(sigma_2 | sigma_1) = frac(M_0(sigma_1 sigma_2))(sum_(sigma' in V)(M_0(sigma_1 sigma')))`.
An `n`th order model can be expressed in terms of a zero-order `(n+1)`-gram model:
`M_n(sigma_(n+1) | sigma_1 ... sigma_n) = frac(M_0(sigma_1 ... sigma_(n+1)))(sum_(sigma' in V)(M_0(sigma_1 ... sigma_n sigma')))`.

Example

Consider the phrase "first witch" in the Shakespeare corpus.
This occurs, for instance, to identify which of three witches is speaking in Macbeth.
The phrase appears 23 times; whereas, "first" appears 1,349 times. There are a total of 912,051 bi-grams in Shakespeare.
So we have:
`M_0(mbox("first witch")) = frac(23)(912,051) approx 0.0025%`
`M_1(mbox("witch") | mbox("first")) = frac(23)(1349) approx 1.7%`.
On the other hand, the first order model assigns 0 probability to the phrase "our the", because the phrase never appears in the text.

Smoothing

Notice although "our the" does not occur in usual speech, we have been using it a whole bunch in our discussion.
So maybe in the real word the probability of this bi-gram should be small but non-zero.
Other phrases like "fourth witch" seem plausible, but don't appear as a bi-grams in the Shakespeare corpus, so would also be assigned probability `0` in our first-order model.
One way to modify our first-order model to improve it for these cases is to smooth it with the zero-order model.
We define a smoothed first-order `M_1'` as:
`M_1'(sigma_2 | sigma_1) = gamma cdot M_1(sigma_2 | sigma_1) + (1 - gamma) cdot M_0(sigma_2)`
and
`M_0'(sigma_1 sigma_2) = gamma cdot M_0(sigma_1 sigma_2) + (1 - gamma)cdot M_0(sigma_1) cdot M_0(sigma_2)`,
where `0 le gamma le 1` is a smoothing parameter.
As an example, if `gamma = 0.5`, the zero-order, smoothed bi-gram model gives "first witch" a probability now of 0.0013% -- less than before.
This is because it now assigns bi-grams like "fourth witch" a non-zero probability, in this case 0.00000030%.
Often it is the case that we will smooth a small collection from say an author with a larger collection say a big corpus of English text. This might be done with an equation like:
`M_(S,0)' = gamma cdot M_(S,0) + (1 - gamma) cdot M_(L,0)`.

Text Formats, Tokenization, Term Distributions, Language Models

Outline

Introduction

Text Formats

XML

More Text Formats

Quiz

Tokenization of English Text

Term Distributions

Language Modeling

Unknown Shakespeare

Higher-order Models

Example

Smoothing