Text Formats, Tokenization, Term Distributions, Language Models




CS267

Chris Pollett

Feb 1, 2021

Outline

Introduction

Text Formats

Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a 
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet</a></p>
</body>
</html>

XML

More Text Formats

Here is a brief description of some other text formats:

Quiz

Which of the following is true?

  1. This semester, in CS 267, students will receive five homework scores and an overall quiz score. The lowest of these scores will be dropped.
  2. Text summarization is in no way connected with information retrieval.
  3. One of the most important data structures in information retrieval is the inverted egg timer.

Tokenization of English Text

Term Distributions

Rank Frequency Token
1     107,833  <LINE>
2     107,833  </LINE>
3      31,081  <SPEAKER>
4      31,081  </SPEAKER>
5      31,028  <SPEECH>
6      31,028  </SPEECH>
7      28,317  the
8      26,022  and
9      22,639  i
10     19,898  to

Language Modeling

Unknown Shakespeare

Higher-order Models

Example

Smoothing