Chris Pollett> CS267
( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#1 --- last modified February 14 2022 22:52:08.

Solution set.

Due date: Feb 14

Files to be submitted:
  Hw1.zip

Purpose: To gain some experience with language modeling, to dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that. Make sure this folder is smaller than 10MB so it can be uploaded to my homework submission system. The first part consists of the following exercises which shuld be included in a file Hw1.pdf.

  1. Pick an author from among those listed on VictoriaStudies.net. Pick three of their favorite works. Treat this as your corpus. (a) Compute by using searches on this site the MLE with respect to this three document corpus for the terms: (i) it, (ii) was, (iii) the, (iv) best, (v) of, (vi) times, (vii) worst. (b) Then compute using these the likelihood of the phrase: It was the best of times, it was the worst of times.
  2. Consider the fourth paragraph of Section 1.1 of the article Logic of Belief Revision from the Stanford Encyclopedia of Philosophy. For each term in this paragraph determine its frequency. Assume case normalization, but no stemming. Create a log scale plot of frequency versus rank. For this small a sample size, how well is Zipf's Law followed? (Do curve fitting and compute a `chi^2` to estimate.)
  3. Do Exercise 1.8 out of the book.

For the second part of the homework, I want you to download and install Yioop as described in class. Consider the web pages: Amazon landing page, Rotten Tomatoes (Movie Reviews) landing page, and the The Sweet (Glam Rock Band) Wikipedia page. Without looking at these pages identify three information needs you think the page would provide in descending order of importance. For each of these information needs, come up with a query you might type into a search engine to obtain this information need. Next look at each of these pages, both normally in a browser, and by viewing the source html. For each page, by hand make a five sentence summary of the contents of the page using only phrases appearing on the page. For each of the queries you made, calculate by hand the number of the summaries you created that contained all the query words. Log in to the root account of your copy of Yioop, go to the Page Options activity. Here you can select between several summarizers for a web crawl. I want you to conduct your experiments using the BASIC and Centroid Weighted Summarizers. Under Page Options go to the Test Options tab and choose by URI as your method of testing, then test each of these web pages. By looking at the output of these tests determine what Yioop though were the top 5 sentences for each page (Look at both the description and the description scores to figure this out). Then for Yioop's five sentence summaries, determine for each query which summaries had all the query terms. How did Yioop's summarizers compare with the human results? Write your analysis in summaries.pdf and include this in your Hw1.zip file.

For Part 3, I want you to code a simple inverted index printer for character n-grams. You have some flexibility in the choice of the language you use to code this project (see below). It will be run from the command line with a line like:

aux_program CharGramIndexPrinter corpus_file n

Here aux_program is either empty (if you decide to use C, C++, or Rust) or java, node, php, python (your choice if you decide to code using of of these languages). For node, php, or python, CharGramIndexPrinter should be either CharGramIndexPrinter.js, CharGramIndexPrinter.php, or CharGramIndexPrinter.py respectively. Your program will be compiled on a Mac running MacOS Monterey, having used brew to install the compiler/interpreter. As an example input line, I might use in running the program is:

php CharGramIndexPrinter.php some-corpus.txt 4

The corpus_file is assumed to be a ASCII text and \n is used to indicate a newline. Within it, a line beginning with --- (three hyphens) indicates the start of a new document. A "term" within a document is any sequence of `n` characters where `n` comes from the command line. All characters are converted to lower case before processing. A whitespace character is converted to an underscore _, before processing and each document is prepended with n-1 underscores and append with `n-1` underscores before processing. So for example, if some-corpus.txt was:

The quick    or  
the brown fox
---
jumped over sally
---
and the brilliant bob fox
---
but not the brown peter bunny

The first document before processing would be converted to the string:

___the_quick_or_the_brown_fox___

The character 4 grams in this document would be:

___t
__th
_the
the_
he_q
e_qu
_qui
quic
uick
ick_
ck_o
k_or
_or_
or_t
r_th
_the
the_
he_b
e_br
_bro
brow
rown
own_
wn_f
n_fo
_fox
fox_
ox__
x___

The corpus would be considered to have four documents: (1) The quick or the brown fox, (2) jumped over sally, (3) and the brilliant bob fox, (4) but not the brown peter bunny. When run, your program should sort all terms seen in any document. For each term in this sorted order, your program should output to stdout, the term on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_num, character position within the document that the term appears at). For example, the entry for _jum might look like:

_jum
1,1,(1,2)

Notice the first document is treated as 0, the second as 1, similarly the first position within a document is treated a 0, the next as 1, etc. Once done outputting this data for every term, your program should stop. That is all there is to Part 3.

Point Breakdown

Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts
summaries.pdf contains human experiments (1pt), Yioop experiment results (have at least one screenshot) (1pt), comparison (1pt) 3pts
Code runs as described and reads the documents in the file into a reasonable data structure (1/2pt each) 1pt
Program creates an inverted index in memory for n-chargrams (1pt keeps count of num_docs and frequency for each chargram, 1pt keeps track of information needed for each record associated with an entry). 2pts
Program outputs inverted index as described. 1pts
Total 10pts