Text Formats
- Human language data in electronic text format represents the raw material of IR.
- To build an IR system it is important to understand the different formats this text can come in.
- Example formats include: HTML, Word DOC, Word DOCX, XML, LATeX, RTF, PDF, RSS, etc.
- HTML and XML are perhaps the most important of these for writing a search engine.
- This is because unlike plain text, these formats allow us to explicitly use links to represent relationships
between documents and parts of documents.
Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet</a></p>
</body>
</html>
Term Distributions
Rank Frequency Token
1 107,833 <LINE>
2 107,833 </LINE>
3 31,081 <SPEAKER>
4 31,081 </SPEAKER>
5 31,028 <SPEECH>
6 31,028 </SPEECH>
7 28,317 the
8 26,022 and
9 22,639 i
10 19,898 to
- The above represents a term-frequency table for the 10 most common terms in an XML collection
of Shakespeare's plays.
- As you can see 'the' is the most common word, followed by other pronouns, conjunctions, prepositions, etc.
- Of the 22,943 words and 43 tags in the Shakespeare collection, 8,336 terms occur only once. For example, zwaggered occurs once.
- The relative frequency of tags is hard to predict, but for English words -- this is true of other languages as well --
there is a predictible relationship between the frequency of a term and its rank called Zipf's Law.
- It says that the frequency of the `i`th most common term, `F_i`, will be proportional to `1/i^{alpha}` for some constant `alpha`. (For English, `alpha` is usually close to 1.)
- Knowing this law allows us to figure out things like the most important terms in a query. It also can be used to develop a model for the relevance of a term to a document.