Chris Pollett> CS267
( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#3 --- last modified March 09 2022 20:56:15.

Solution set.

Due date: Apr 4

Files to be submitted:
  Hw3.zip

Purpose: To gain familiarity with static inverted index construction techniques. To experiment with Yioop as an IR library.

Related Course Outcomes:

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics. (For this assignment, we are computing statistics about index size)

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of three parts: An exercise part, a coding part, and an evaluation part. For the exercise part, I want you to do the following exercises out of the book. Exercise 4.1, modified so that now you have 48 million postings each of which is 8 bytes long, Exercise 4.4, and Exercise 4.5 (here rather than just support prefix/begins with queries like foo*, you should also support more general queries like f?oo* where ? can match any single character).

For the coding part of the assignment I want you to write code that will allow you to conduct experiments on using the wrong language's stemmer when processing a web page. First, I want you to go to a website that has statistics for popular web search engine queries such as: Mondovo. From these pick a list of 5 multi-word queries you find interesting. Next on at least Google and Bing search on these queries and obtain the urls of the top five results for each. The web pages associated with these urls will be your corpus.

Your program to conduct the experiments should be in a sub-folder of your Hw3.zip file submitted named code. After unzipping your Hw3.zip file and changing directory into the resulting folder, I will at the command line switch into the code subfolder. This should have a file composer.json. So that I can install any dependencies your code has by typing:

composer install

Before I type this, your submitted project should have an empty vendor subfolder. Your program will then be run by typing a command with the format:

php term_preprocess.php some_file_name locale_or_none

For example, I might type:

php term_preprocess.php test_urls.txt fr-FR

In the above, some_file_name should be the name of a file containing urls, with one url/line. locale_or_none should have value either none, or the name of a locale for which Yioop has a stemmer. Your program should fetch the page for each url given (we will assume the urls are for HTML pages only), extract text as described below, split the resulting text into terms, then stem the terms using the given stemmer (or do nothing in the case of none). For each url in some_file_name, your program should output the url, followed by a sorted list of appropriately stemmed words found, followed by a blank line. For example, the following output might occur if there were two urls:

https://some_made_up_site1.com/
I 
knew
onc

https://some_made_up_site2.com/
a
cat
jumpi

To download the urls, I want you to use the method seekquarry\yioop\library\FetchUrl::getPages(). Except for the first argument to this function, you can leave all other arguments at their default value. I want you to get all the urls using just one call to this method. To do the stemming I want you to use Yioop as in the example of using composer from class. I want you to use the seekquarry\yioop\library\processors\HtmlProcessor class to do the text extraction from the HTML documents. To do this create an instance of HtmlProcessor with max description set at 20000, the summarizer as CrawlConstants::CENTROID_WEIGHTED_SUMMARIZER, and everything else at its default value. Then call the process() method with the read in contents of an HTML file and the url that the page came from. This method should, as part of its returned value, provide a guessed locale. As a sanity check, you can note if it succeeded in guessing correctly. To better understand the returned components of this method, it helps to look at src/library/CrawlConstants.php.

To conduct your experiments using this program, make a file my_urls.txt with the list of urls prepared above. Make another file queries with the queries you came up with. Include these as part of your Hw3.zip file. Run your program on these URLs using the locale none, the correct locale, and some other locale for which there is a stemmer.

Assume each of your queries is a disjunctive query and that we are ranking results using TF-IDF scores and the vector space model. Compute by hand (script augmented if desired), showing work, the MAP scores for your topics in the case where the terms in the corpus and query were not stemmed, were stemmed correctly, and were stemmed incorrectly. What are your conclusions about the effect of stemming versus not versus incorrectly stemming? Take into consideration how likely the locale was guessed correctly. Put all your work for these experiments and your conclusions about them in the Hw3.pdf file you submit with the homework.

Point Breakdown

Book Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts
Code is reasonably well-commented and structured/code folder is as described and composer install does install what's needed for this assignment. 1pt
term_preprocess.php operates as described above. 1pt
term_preprocess.php makes use of seekquarry\yioop\library\FetchUrl::getPages as described. 1pt
term_preprocess_index.php makes use of seekquarry\yioop\library\PhraseParser::stemTerms as described. 1pt
term_preprocess_index.php makes use of seekquarry\yioop\library\processors\HtmlProcessor as described. 1pt
Experiment work (1pt) and write up of experiments conducted (1pt). 2pts
Total 10pts