Chris Pollett>
CS267 |
HW#3 --- last modified November 01 2023 16:42:30.Due date: Oct 27 Files to be submitted: Purpose: To gain familiarity with static inverted index construction techniques. To experiment with Yioop as an IR library. Related Course Outcomes: The main course outcomes covered by this assignment are: CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics. (For this assignment, we are computing statistics about index size) CLO6 -- Be able to evaluate search results by hand and using TREC eval software.Specification: This homework consists of three parts: An exercise part, a coding part, and an evaluation part. For the exercise part, I want you to do the following exercises: Exercise 4.1, modified so that now you have 96 million postings each of which is 2 bytes long and Exercise 4.3 from the book. Then I want you to read the paper:
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran. Inverted indexes for phrases and strings. Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 555--564. 2011.
It proposes a suffix tree approach to allow for exact phrase search without increasing the index size too much over single term indexes. I want you to write as the third written exercise a 1 page summary of their approach. It should explain what phrases are stored in the dictionary, how they are chosen, and how this allows for exact phrase search. The coding portion of your homework, I want you to use the classes in the Yioop library to implement the dictionary-as-a-string approach to sorted index construction, then conduct experiments on document retrieval from your index using some of the IR evaluation techniques we have considered. First, I want you to go to a website that has statistics for popular web search engine queries such as: Mondovo. From these pick a list of 5 multi-word queries you find interesting. Next on at least Google and Bing search on these queries and obtain the urls of the top five results for each. The web pages associated with these urls will be your corpus. Your programs to conduct the experiments should be in a sub-folder of your Hw3.zip file submitted named code. You will write two main programs After unzipping your Hw3.zip file and changing directory into the resulting folder, I will at the command line switch into the code subfolder. This should have a file composer.json. So that I can install any dependencies your code has by typing: composer install Before I type this, your submitted project should have an empty vendor subfolder. Your index constructing program will then be run by typing a command with the format: php dictionary_as_string_crawler.php seed_urls_file index_file For example, I might type: php dictionary_as_string_indexer.php my_seeds.txt my_index.txt You can assume that the file seed_urls_file consists of a sequences of urls, 1 url per line. For your experiments, the urls should be obtained from the popular queries as we described above. dictionary_as_string_indexer.php should fetch the page for each url given (we will assume the urls are for HTML pages only), extract text as described below, split the resulting text into terms, then stem the terms using the en-US stemmer (or do nothing in the case of no data). Given these processed documents, your program should output a dictionary-as-a-string index to the file index_file. Your index only has to be a docid index, and to keep things simple, we assume the document id of a downloaded page is the line number in the seed_urls_file of the url it corresponds to. Your second program handles look-up into your index and will be run using a line like: php lookup_index.php index_file query The index_file should be a file produced by dictionary_as_string_indexer.php program on some input and query should be a string consisting of space separated terms that should be interpreted as a disjunctive query. For example, php lookup_index.php my_index.txt "harry potter" On such an input, your program should output a sequence of (docid, cosine similarity score) tuples, one per line, that satisfy the query. For example, (13,.622) (7,.399) (10,.13) (15,.0255) I will be reasonably flexible on the implementation of details of these programs, but I want each program to make use of Yioop as a library to do certain things:
To conduct your experiments using this program, make a file my_urls.txt with the list of urls prepared above. Make another file queries with the queries you came up with. Include these as part of your Hw3.zip file. Run your dictionary_as_string_indexer.php on these URLs to make an index file my_index.txt. Assume each of your queries is a disjunctive query and that we are ranking results using TF-IDF scores and the vector space model. Compute by hand (script augmented if desired), showing work, the MAP scores for your topics and the output of lookup_index.php. Do the same for Google and Bing. Put all your work for this experiment and your conclusions about it in the Hw3.pdf file you submit with the homework.
|