Text Formats, Tokenization, Term Distribution, Language Models




CS254

Chris Pollett

Aug. 29, 2011

Outline

Text Formats

Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a 
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamt</a></p>
</body>
</html>

XML

More Text Formats

Here is a brief description of some other text formats:

Quiz

Which of the following is true?

  1. The word document in an information retrieval setting always means web page.
  2. The inverted index for a document collection is typically much smaller than the collection itself.
  3. Question answering is an example application of information retrieval.

Tokenization of English Text

Term Distributions

Rank Frequency Token
1     107,833  <LINE>
2     107,833  </LINE>
3      31,081  <SPEAKER>
4      31,081  </SPEAKER>
5      31,028  <SPEECH>
6      31,028  <SPEECH>
7      28,317  the
8      26,022  and
9      22,639  i
10     19,898  to

Language Modeling