Chris Pollett> CS267
( Print View )

Student Corner:
[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#1 --- last modified August 28 2023 07:06:15.

Solution set.

Due date: Sep 11

Files to be submitted:
  Hw1.zip

Purpose: To gain some experience with language modeling, to dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that. Remember to also include a readme.txt file with a list of the team mates of your group as well as any other code set-up, etc. details you would like to mention to help me grade your homework.

The first part consists of the following exercises which should be included in a file Hw1.pdf.

  1. The site The Victoria Studies Archive can be used to compute concordance information about various Victorian era authors. Pick your favorite Victoria author (for example, Charles Dickens). Pick three books by that author (for example, Oliver Twist, David Copperfield, Nicholas Nickelby). Using these three works as your corpus and using searches into the site, compute the MLE of the following terms (i) my, (ii) cat, (iii) likes, (iv) dog, (v) food. Then (b) compute using these the likelihood of the phrase: My cat likes dog food.
  2. Pick a paragraph from one of the works you chose for the first part. For each term in this paragraph determine its frequency. Create a log scale plot of frequency versus rank. For this small a sample size, how well is Zipf's Law followed?
  3. Assume we start in state 1 using the Hidden Markov Model example diagram from class. What are all the phrases of length 3 that could be generated and what are each of their probabilities? Show how these could be computed using matrix multiplication.

For Part 2 of the homework, I want you to download and install Yioop as described in class. Consider the web pages: the Etsy landing page, the Cinequest Film Festival page, and the Halifax Explosion Wikipedia Page (since the Oppenheimer movie just came out, and I was born in Halifax). Without looking at these pages identify three information needs you think the page would provide in descending order of importance. For each of these information needs, come up with a query you might type into a search engine to obtain this information need. Next look at each of these pages, both normally in a browser, and by viewing the source html. For each page, by hand make a five sentence summary of the contents of the page using only phrases appearing on the page. For each of the queries you made, calculate by hand the number of the summaries you created that contained all the query words. Log in to the root account of your copy of Yioop, go to the Page Options activity. Here you can select between several summarizers for a web crawl. I want you to conduct your experiments using the BASIC and Centroid Weighted Summarizers. Under Page Options go to the Test Options tab and choose by URI as your method of testing, then test each of these web pages. By looking at the output of these tests determine what Yioop though were the top 5 sentences for each page (Look at both the description and the description scores to figure this out). Then for Yioop's five sentence summaries, determine for each query which summaries had all the query terms. How did Yioop's summarizers compare with the human results? Write your analysis in summaries.pdf and include this in your Hw1.zip file.

For Part 3, I want you to code a simple inverted index printer. It will be run from the command line with a line like:

aux_program IndexPrinter filename

Here aux_program is at most one of C/C++, java, php, python, (your choice). For php or python, IndexPrinter should be either IndexPrinter.php or IndexPrinter.py respectively. As an example, I might type

java IndexPrinter foo.txt

If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled/run on a Mac, having used brew to install the compiler or interpreter. filename is the name of some txt document. This document is assumed to contain only ASCII characters with '\n' used for line endings. For this assignment, the sequence '\n\n' indicates a new "document" within this file. For example, if foo.txt contained:

this is a first document

awesome second document

a third document

Then it would represent a corpus of three documents. A "term" is any sequence of characters that does not include a white-space character, decimal digits, or punctuation. Once run, your code should read in this file into a data structure of your choice in memory. We assume for this homework the file is small enough to be read into memory. Your program should sort all terms seen in any document. It should then output three lines each delimited by a new line character. The first line should consist of a sequence of four decimal character numbers (in a real index we might use int's):

num_terms_in_vocabulary position_of_start_of_first_term position_of_start_of_second_term ...

The second line should consist of alphabetically ordered terms followed by four decimal characters number offsets into the third line. Finally, the third line consists of a sequence of comma separated decimal character numbers representing documents terms appeared. The three lines for the foo.txt document presented above should look like:

000800000005001600280037004300530062
a0000awesome0004document0006first0012is0014second0016third0018this0020
1,3,2,1,2,3,1,1,2,3,1

In the example above, 0008 in the first line represents that the corpus has eight distinct terms in its vocabulary. The 0016 in the first line points to document0006. Between 0006 and 0012 are the documents that the term "document" appears in. i.e., 1,2,3, (the first, second, and third document). Once done outputting this data your program should stop. That is all there is to Part 3.

Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts
summaries.pdf contains human experiments (1pt), Yioop experiment results (have at least one screenshot) (1pt), comparison (1pt) 3pts
Code runs as described (1/2pt) andv(1/2pt) 1pt
Program outputs on all tests cases output the appropriate three lines using the formats described above (1pt/line) 3pts
Total10pts