Probability Ranking Principle
The fundamental goal of relevance ranking is often expressed in terms of the Probability Ranking Principle:
If an IR system's response to each query is a ranking of the documents
in the collection in order of decreasing probability of relevance, then
the overall effectiveness of the system to its users will be maximized.
- This serves as a guiding principle that underlies much research, but nevertheless overlooks such things as the size and scope of the document.
- The specificity of the document measures the degree to which its contents are focused on the information need. For example, the document might satisfy the need but also contain lots of other garbage.
- The exhaustivity of the document reflects the degree to which it covers the information related to the topic.
- The novelty of a document in a document list reflects how much new information the n-th search result gives to the user given the previous results.
Text Formats
- Human language data in electronic text format represents the raw material of IR.
- To build an IR system it is important to understand the different formats this text can come in.
- Example formats include: HTML, Word DOC, Word DOCX, XML, LATeX, RTF, PDF, RSS, etc.
- HTML and XML are perhaps the most important of these for writing a search engine.
- This is because unlike plain text, these formats allow us to explicitly use links to represent relationships
between documents and parts of documents.
Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet</a></p>
</body>
</html>
Term Distributions
Rank Frequency Token
1 107,833 <LINE>
2 107,833 </LINE>
3 31,081 <SPEAKER>
4 31,081 </SPEAKER>
5 31,028 <SPEECH>
6 31,028 </SPEECH>
7 28,317 the
8 26,022 and
9 22,639 i
10 19,898 to
- The above represents a term-frequency table for the 10 most common terms in an XML collection
of Shakespeare's plays.
- As you can see 'the' is the most common word, followed by other pronouns, conjunctions, prepositions, etc.
- 8,336 terms, of these 22,943 words and 43 tags in this collection, occur only once. For example, zwaggered occurs once.
- The relative frequency of tags is hard to predict, but for English words -- this is true of other languages as well --
there is a predictible relationship between the frequency of a term and its rank called Zipf's Law.
- It says that the frequency of the `i`th most common term, `F_i`, will be proportional to `1/i^{alpha}` for some constant `alpha`. (For English, `alpha` is usually close to 1.)
- Knowing this law allows us to figure out things like the most important terms in a query. It also can be used to develop a model for the relevance of a term to a document.