Text Formats, Tokenization, Term Distribution, Language Models




CS267

Chris Pollett

Aug. 27, 2012

Outline

Text Formats

Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a 
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet/a></p>
</body>
</html>

XML

More Text Formats

Here is a brief description of some other text formats:

Quiz

Which of the following is true?

  1. The probability ranking principle says: "If an IR system's response to each query is a ranking of the documents in the collection in order of increasing probability of relevance, then the overall effectiveness of the system to its users will be maximized".
  2. Document relevance is sometimes determined by human judgment.
  3. Question answering and web search are the same thing.

Tokenization of English Text

Term Distributions

Rank Frequency Token
1     107,833  <LINE>
2     107,833  </LINE>
3      31,081  <SPEAKER>
4      31,081  </SPEAKER>
5      31,028  <SPEECH>
6      31,028  <SPEECH>
7      28,317  the
8      26,022  and
9      22,639  i
10     19,898  to

Language Modeling

Unknown Shakespeare

Higher-order Models