Text Formats, Tokenization, Term Distributions, Language Models




CS267

Chris Pollett

Aug 27, 2018

Outline

Introduction

Documents and the Update Model

Performance Evaluation

Probability Ranking Principle

The fundamental goal of relevance ranking is often expressed in terms of the Probability Ranking Principle:

If an IR system's response to each query is a ranking of the documents 
in the collection in order of decreasing probability of relevance, then 
the overall effectiveness of the system to its users will be maximized.

Quiz

Which of the following is true?

  1. IR is concerned with representing, searching, and manipulating large collections of text and data.
  2. A major task of a search engine is to maintain and provide access to a circular index.
  3. There will be six homeworks this semester for this class.

Text Formats

Example HTML page
<html>
<head><title>A Line of Hamlet</title></head>
<body>
<h1>Hamlet being serious</h1>
<p>To be, or not to be,--that is the question:--</p>
<p><a 
href="http://www.gutenberg.org/cache/epub/1524/pg1524.html"
>Full text of hamlet/a></p>
</body>
</html>

XML

More Text Formats

Here is a brief description of some other text formats:

Tokenization of English Text

Term Distributions

Rank Frequency Token
1     107,833  <LINE>
2     107,833  </LINE>
3      31,081  <SPEAKER>
4      31,081  </SPEAKER>
5      31,028  <SPEECH>
6      31,028  <SPEECH>
7      28,317  the
8      26,022  and
9      22,639  i
10     19,898  to

Language Modeling

Unknown Shakespeare

Higher-order Models