HW#3 --- last modified April 12 2021 17:27:05.

Solution set.

Due date: Apr 16

Files to be submitted:
  Hw3.zip

Purpose: To gain familiarity with static inverted index construction techniques. To experiment with Yioop as an IR library.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics. (For this assignment, we are computing statistics about index size)

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Description:

This homework consists of three parts: An exercise part, a coding part, and an evaluation part. For the exercise part, I want you to do the following exercises out of the book. Exercise 4.1, modified so that now you have 96 million postings each of which is 2 bytes long, Exercise 4.3, and Exercise 4.4.

For the coding part of the homework, I want you to write a PHP program that makes use of Yioop as a library using Composer. After unzipping your Hw3.zip file and changing directory into the resulting folder, I will at the command line switch into a subfolder code. This should have a file composer.json. So that I can install any dependencies your code has by typing:

composer install

Before I type this, your submitted project should have an empty vendor subfolder. Your program will then be run by typing a command with the format:

php index_stats.php some_file_name preprocess_type

For example,

php index_stats.php exciting_urls.txt stem_and_stop

In the above, some_file_name should be the name of a file containing urls, with one url/line. preprocess_type should be one of the four values: plain, stop, stem, stem_and_stop. Your program should fetch the page for each url given (we will assume the urls are for English HTML pages only). It should run a summarizer on these pages, split the resulting document into terms, then either do no processing on these terms (if preprocess_type was plain), eliminate stop terms (if preprocess_type was stop), stem all the terms (if preprocess_type was stop), or eliminate stop terms and then and stem terms (if preprocess_type was stem_and_stop). Finally, it should output the name of some_file_name, the preprocess_type, the total number of distinct terms across all downloaded urls (a proxy for dictionary size), and output total posting list size (assume postings are for a positional index) for all these terms. An example output might look like:

File: wikipedia_urls.txt
Preprocess Type: stem_and_stop
Terms found: 5678
Posting List Size: 78912

To download the urls, I want you to use the method seekquarry\yioop\library\FetchUrl::getPages(). Except for the first argument to this function, you can leave all other arguments at their default value. I want you to get all the urls using just one call to this method. To create summaries of downloaded pages, use the seekquarry\yioop\library\processors\HtmlProcessor class to do the text extraction from the HTML documents. To do this create an instance of HtmlProcessor with max description set at 20000, the summarizer as CrawlConstants::CENTROID_SUMMARIZER, and everything else at its default value. This method should, as part of its returned value, provide a locale that can be used with the stemmer/stopper. To better understand the returned components of this method, it helps to look at src/library/CrawlConstants.php. To do stemming and stopping, you can make use of a locale's Tokenizer class. To get an instance of the appropriate Tokenizer, you can call seekquarry\yioop\library\PhraseParser's getTokenizer($locale_name) method. Once you have a Tokenizer, you can call its stopwordsRemover($data) method to remove stop words, and stem($word) method to do stemming.

For the experiments portion of the homework I want you to try a variety of websites using each of the four preprocess_types to try to determine the effects of stemming and stop on dictionary size and on overall index size. Be quantitative in your analysis, estimate effects of size differences like query performance, and draw concrete conclusions. Give some small concrete numerical examples where stemming improves recall, and where it decreases precision. Write up your experiments in Hw3.pdf file you submit with the homework.

Point Breakdown
Book Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)3pts
Code is reasonably well-commented and structured/code folder is as described and composer install does install what's needed for this assignment.1pt
index_stats.php operates as described above.1pt
index_stats.php makes use of seekquarry\yioop\library\FetchUrl::getPages as described.1pt
index_stats.php makes use of seekquarry\yioop\library\processors\HtmlProcessor as described.1pt
index_stats.php instantiates Tokenizer as described and stopwordsRemover($data) and stem($word) called as described..1pt
Program produces correct output on test files.1pt
Experiment write-up.1pt
Total10pts