Text Formats, Tokenization, Term Distributions, Language Models




CS267

Chris Pollett

Aug 26, 2019

Outline

Introduction

Text Formats

Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a 
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet/a></p>
</body>
</html>

XML

More Text Formats

Here is a brief description of some other text formats:

Tokenization of English Text

Quiz

Which of the following is true?

  1. An inverted indexes is just a sorting of a document collection from the most relevant document to the least relevant document.
  2. An IR system's efficiency is measured using human judgments of its search results.
  3. This semester in CS 267 students will receive five homework scores and an overall quiz score. The lowest of these scores will be dropped.

Term Distributions

Rank Frequency Token
1     107,833  <LINE>
2     107,833  </LINE>
3      31,081  <SPEAKER>
4      31,081  </SPEAKER>
5      31,028  <SPEECH>
6      31,028  </SPEECH>
7      28,317  the
8      26,022  and
9      22,639  i
10     19,898  to

Language Modeling

Unknown Shakespeare

Higher-order Models

Example

Smoothing