Chris Pollett> Old Classses >
CS267
( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#1 --- last modified September 23 2019 15:09:27.

Solution set.

Due date: Sep 11

Files to be submitted:
Hw1.zip

Purpose: To gain some experience with language modeling, to dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that. Make sure this folder is smaller than 10MB so it can be uploaded to my homework submission system.

The first file I want you to make is a file problems.pdf. In it put your answers to the following questions:

Pick an author from among those listed on VictoriaStudies.net. Pick three of their favourite works. Treat this as your corpus. (a) Compute by using searches on this site the MLE with respect to this three document corpus for the terms: (i) it, (ii) was, (iii) a, (iv) dark, (v) and, (vi) stormy, (vii) night. (b) Then (b) compute using these the likelihood of the phrase: It was a dark and stormy night.
Take the fourth paragraph from the Skeptic's Dictionary Zombie Page (the paragraph on philosophical zombies). For each term in this paragraph determine its frequency. Assume case normalization, but no stemming. Create a log scale plot of frequency versus rank. For this small a sample size, how well is Zipf's Law followed? (Do curve fitting and compute a `Chi^2` to estimate.)
Do exercise 1.7 out of the book. Then get a good upper bound on the number of states one needs to add as a function of the number of states and transitions in the input model.

For the second part of the homework, I want you to download and install Yioop as described in class. Consider the web pages: Yahoo landing page, Internet Movie Database landing page, and the Wikipedia Canada page. Without looking at these pages identify three information needs you think the page would provide in descending order of importance. For each of these information needs, come up with a query you might type into a search engine to obtain this information need. Next look at each of these pages, both normally in a browser, and by viewing the source html. For each page, by hand make a five sentence summary of the contents of the page using only phrases appearing on the page. For each of the queries you made, calculate by hand the number of the summaries you created contained all the query words. Log in to the root account of your copy of Yioop, go to the Page Options activety. Here you can select between several summarizers for a web. I want you to conduct your experiments using the BASIC and Centroid Weighted Summarizers. Under Page Options go to the Test Options tab and choose by URI as your method of testing, then test each of these web pages. By looking at the output of these tests determine what Yioop though were the top 5 sentences for each page (Look at both the description and the description scores to figure this out). Then for Yioop's five sentence summaries, determine for each query which summaries had all the query terms. How did Yioop's summarizers compare with the human results? Write your analysis in summaries.pdf and include this in your Hw1.zip file.

For Part 3, I want you to code a simple inverted index printer. It will be run from the command line with a line like:

aux_program IndexPrinter corpus_file exclude_term1 exclude_term2 ...

Here aux_program is at most one of C, C++, java, php, python (your choice). For php or python, IndexPrinter should be either IndexPrinter.php or IndexPrinter.py respectively. If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS Mojave, having used brew to install the compiler/interpreter. As an example input line, I might use in running the program is:

php IndexPrinter.php some-corpus.txt bob sally.

The corpus_file is assumed to be a ASCII text and \n is used to indicate a newline. Within it, a line beginning with --- (three hyphens) indicates the start of a new document. A "term" within a document is any sequence of characters that does not include a white-space character. So for example, if some-corpus.txt was:

The quick or the brown fox
---
jumped over sally
---
and the brilliant bob fox
---
but not the brown peter bunny

the corpus would be considered to have four documents: (1) The quick or the brown fox, (2) jumped over sally, (3) and the brilliant bob fox, (4) but not the brown peter bunny, the first of these documents has six terms. You can do case normalization on terms but don't stem them. So "The" and "the" are the same term, but "run" and "runs" aren't. When run, your program should sort all terms seen in any document that does not contain the exclude terms. In this example, documents (1) and (4). For each term in this sorted order, your program should output to stdout, the term on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_num, character position within the document that the term appears at). For the example above, the output should look like:

brown
2,2,(1,5),(4,4)
bunny
1,1,(4,6)
but
1,1,(4,1)
fox
1,1,(1,6)
not
1,1,(4,2)
or
1,1,(1,3)
peter
1,1,(4,5)
quick
1,1,(1,2)
the
2,3,(1,1),(1,4),(4,3)

Once done outputting this data your program should stop. That is all there is to Part 3.

Point Breakdown
Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	3pts
summaries.pdf contains human experiments (1pt), Yioop experiment results (have at least one screenshot) (1pt), comparison (1pt)	3pts
Code runs as described and reads the documents in the file into a reasonable data structure (1/2pt each)	1pts
Program creates an inverted index in memory (1pt keeps count of num_docs and frequency for each term, 1pt keeps track of information needed for each record associated with an entry).	2pts
Program outputs inverted index as described, excluding documents as described.	1pt
Total	10pts