Chris Pollett> Old Classses >
CS267

( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#3 --- last modified October 14 2019 22:13:52.

Solution set.

Due date: Oct 21

Files to be submitted:
  Hw3.zip

Purpose: To gain familiarity with static inverted index construction techniques. To experiment with Yioop as an IR library.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of three parts: An exercise part, a coding part, and an evaluation part. For the exercise part of the assignment I want you to do the following book problems or variants of book problems:

  1. Do Exercise 4.5. Implement with pseudocode. Rather than just support prefix/begins with queries like foo*, you should also support more general queries like *foo (ends with foo), and *foo* (contains foo).
  2. Suppose we were using sort based indexing. After index time we want to support adding new documents to the corpus. Suggest how to do this, give pseudo-code. Analyze the performance of your proposal.
  3. Do Exercise 5.1.

For the coding part of the homework, I want you to write a PHP program to compare using chargramming versus using stemming. Your program should make use of Yioop as a library using Composer. After unzipping your Hw3.zip file and changing directory into the resulting folder, I will at the command line switch into a subfolder code. This should have a file composer.json. So that I can install any dependencies your code has by typing:

composer install

Before I type this, your submitted project should have an empty vendor subfolder. Your program will then be run by typing a command with the format:

php term_preprocess_index.php some_file_name pre_process_method

For example,

php term_preprocess_index.php exciting_urls.txt char

In the above, some_file_name should be the name of a file containing urls, with one url/line. pre_process_method should have value either none, char, or stem. Your program should fetch the page for each url given (we will assume the urls are for HTML pages only), split the document into terms, then either do no processing on these terms, char-grams the terms, or stem these terms. Finally, it should output a sorted list in the same format as homework 1. The docid for this homework is which line the url was in the some_name_file.

To download the urls, I want you to use the method seekquarry\yioop\library\FetchUrl::getPages(). Except for the first argument to this function, you can leave all other arguments at their default value. I want you to get all the urls using just one call to this method. To do the stemming I want you to use Yioop as in the example of using composer from class. In char-gramming, I want you to use the seekquarry\yioop\library\PhraseParser::charGramsTerms() method. In addition, to using Yioop for stemming/chargramming, I want you to use the seekquarry\yioop\library\processors\HtmlProcessor class to do the text extraction from the HTML documents. To do this create an instance of HtmlProcessor with max description set at 20000, the summarizer as CrawlConstants::CENTROID_WEIGHTED_SUMMARIZER, and everything else at its default value. Then call the process() method with the read in contents of an HTML file and the url that the page came from. This method should, as part of its returned value, provide a locale that can be used with the stemmer/char-grammer. To better understand the returned components of this method, it helps to look at src/library/CrawlConstants.php.

In addition to term_preprocess_index.php write another short program, term_preprocess.php, and include it in the same folder. When run with a command line argument in quotes followed by char or stem as an additional argument, it should just output the result of char-gramming or stemming each term in the argument according to the en-US locale. For example,

php term_preprocess "I once knew a jumpy cat" stem

would output:

I onc knew a jumpi cat

For the experiments portion of the homework I want you to use the three urls you used in HW1 and put them in a input text file. Also, you will use the queries you came up for Hw1.

Assume each of your queries is a disjunctive query and that we are ranking results using TF-IDF scores and the vector space model. Compute by hand, showing work, the MAP scores for your topics in the case where the terms in the corpus and query were char-grammed or stemmed or neither. What are your conclusions about the effectiveness of char-graming and stemming? Compare as well the relative sizes of the inverted indexes char-grammed or stemmed or neither cases. Put all your work for these experiments in the Hw3.pdf file you submit with the homework.

Point Breakdown
Book Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts
Code is reasonably well-commented and structured/code folder is as described and composer install does install what's needed for this assignment. 1pt
term_preprocess.php operates as described above. 1pt
term_preprocess_index.php makes use of seekquarry\yioop\library\FetchUrl::getPages as described. 1pt
term_preprocess_index.php makes use of seekquarry\yioop\library\PhraseParser::stemTerms and charGramsTerms as described. 1pt
term_preprocess_index.php makes use of seekquarry\yioop\library\processors\HtmlProcessor as described. 1pt
Program produces correct output on test files. 1pt
MAP scores computed correctly and experiment conclusions seem reasonable. 1pt
Total10pts