Chris Pollett> CS267
( Print View )

Student Corner:
[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#3 --- last modified November 01 2023 16:42:30.

Solution set.

Due date: Oct 27

Files to be submitted:
  Hw3.zip

Purpose: To gain familiarity with static inverted index construction techniques. To experiment with Yioop as an IR library.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics. (For this assignment, we are computing statistics about index size)

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of three parts: An exercise part, a coding part, and an evaluation part. For the exercise part, I want you to do the following exercises: Exercise 4.1, modified so that now you have 96 million postings each of which is 2 bytes long and Exercise 4.3 from the book. Then I want you to read the paper:

Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran. Inverted indexes for phrases and strings. Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 555--564. 2011.

It proposes a suffix tree approach to allow for exact phrase search without increasing the index size too much over single term indexes. I want you to write as the third written exercise a 1 page summary of their approach. It should explain what phrases are stored in the dictionary, how they are chosen, and how this allows for exact phrase search.

The coding portion of your homework, I want you to use the classes in the Yioop library to implement the dictionary-as-a-string approach to sorted index construction, then conduct experiments on document retrieval from your index using some of the IR evaluation techniques we have considered. First, I want you to go to a website that has statistics for popular web search engine queries such as: Mondovo. From these pick a list of 5 multi-word queries you find interesting. Next on at least Google and Bing search on these queries and obtain the urls of the top five results for each. The web pages associated with these urls will be your corpus.

Your programs to conduct the experiments should be in a sub-folder of your Hw3.zip file submitted named code. You will write two main programs After unzipping your Hw3.zip file and changing directory into the resulting folder, I will at the command line switch into the code subfolder. This should have a file composer.json. So that I can install any dependencies your code has by typing:

composer install

Before I type this, your submitted project should have an empty vendor subfolder. Your index constructing program will then be run by typing a command with the format:

php dictionary_as_string_crawler.php seed_urls_file index_file

For example, I might type:

php dictionary_as_string_indexer.php my_seeds.txt my_index.txt

You can assume that the file seed_urls_file consists of a sequences of urls, 1 url per line. For your experiments, the urls should be obtained from the popular queries as we described above. dictionary_as_string_indexer.php should fetch the page for each url given (we will assume the urls are for HTML pages only), extract text as described below, split the resulting text into terms, then stem the terms using the en-US stemmer (or do nothing in the case of no data). Given these processed documents, your program should output a dictionary-as-a-string index to the file index_file. Your index only has to be a docid index, and to keep things simple, we assume the document id of a downloaded page is the line number in the seed_urls_file of the url it corresponds to.

Your second program handles look-up into your index and will be run using a line like:

php lookup_index.php index_file query

The index_file should be a file produced by dictionary_as_string_indexer.php program on some input and query should be a string consisting of space separated terms that should be interpreted as a disjunctive query. For example,

php lookup_index.php my_index.txt "harry potter"

On such an input, your program should output a sequence of (docid, cosine similarity score) tuples, one per line, that satisfy the query. For example,

(13,.622)
(7,.399)
(10,.13)
(15,.0255)

I will be reasonably flexible on the implementation of details of these programs, but I want each program to make use of Yioop as a library to do certain things:

  1. To download the urls, I want you to use the method seekquarry\yioop\library\FetchUrl::getPages(). Except for the first argument to this function, you can leave all other arguments at their default value. I want you to get all the urls using just one call to this method.
  2. To do the stemming I want you to use Yioop as in the example of using composer from class.
  3. I want you to use the seekquarry\yioop\library\processors\HtmlProcessor class to do the text extraction from the HTML documents. To do this create an instance of HtmlProcessor with max description set at 20000, the summarizer as CrawlConstants::CENTROID_WEIGHTED_SUMMARIZER, and everything else at its default value. Then call the process() method with the read in contents of an HTML file and the url that the page came from. This method should, as part of its returned value, provide a guessed locale. As a sanity check, you can note if it succeeded in guessing correctly. To better understand the returned components of this method, it helps to look at src/library/CrawlConstants.php.
  4. The output index file should begin with a record packed and packed using seekquarry\yioop\library\PackedTableTools as three int's, specifying the length of each of the three lines of our inverted index.
  5. The first line of the inverted index consisting of a primary array of int's should be output (dictionary_as_string_indexer.php) and later decode (lookup_index.php) using PHP's pack and unpack functions.
  6. The second line and third lines should be output and read using seekquarry\yioop\library\PackedTableTools.
  7. The lookup_index.php program should use seekquarry\yioop\library\LinearAlgebra to do the maintenance of vectors and compute cosine similarity.

To conduct your experiments using this program, make a file my_urls.txt with the list of urls prepared above. Make another file queries with the queries you came up with. Include these as part of your Hw3.zip file. Run your dictionary_as_string_indexer.php on these URLs to make an index file my_index.txt.

Assume each of your queries is a disjunctive query and that we are ranking results using TF-IDF scores and the vector space model. Compute by hand (script augmented if desired), showing work, the MAP scores for your topics and the output of lookup_index.php. Do the same for Google and Bing. Put all your work for this experiment and your conclusions about it in the Hw3.pdf file you submit with the homework.

Written Exercises (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts
Each of the 7 constraints listed above is satisfied (0.5 each) 3.5pts
dictionary_as_string_indexer.php runs and outputs as described 1pt
lookup_index.php runs and outputs as described 0.5pt
Experiment work (1pt) and write up of experiments conducted (1pt). 2pts
Total10pts