Chris Pollett>
Old Classses > |
HW#2 --- last modified September 16 2019 19:35:19.Due date: Oct 2 Files to be submitted: Purpose: Related Course Outcomes: The main course outcomes covered by this assignment are: CLO1 (Course Learning Outcome 1) -- Code a basic inverted index capable of performing conjunctive queries. CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics. CLO6 -- Be able to evaluate search results by hand and using TREC eval software. Specification: This homework consists of two parts a coding part and an experiment part. Submit all code and documents in the file Hw2.zip that you submit. For the coding part, I want you to write a program which is capable of computing the results of positive boolean queries on a corpus using the docLeft, docRight approach from class. Your code should rank result according to the vector space model. Your program will be run from the command line with a line like: aux_program PositiveRank some_file num_results query Here aux_program is at most one of java, php, python (your choice). For php or python, PositiveRank should be either PositiveRank.php or PositiveRank.py, respectively. If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS Mojave, having used brew to install the compiler. filename is the name of some text document. This document is assumed to contain only ASCII characters with '\n' used for line endings. For this assignment, the sequence '\n\n' indicates a new "document" within this file. In the above, num_results indicates the top how many of the results satisfying the query should be returned. The query is a sequence of command line arguments in Polish notation. Each command line argument in the query is either _AND, _OR, or a term. Here a term is defined to be an alphanumeric string not containing underscore. For this homework, two terms are the same if they are the same after converting them to lower case. An example of running the program might be: python PositiveRank.py my_corpus.txt 5 "_OR _AND good dog _AND bad cat" To compute VSM scores use all the terms from the query. I.e., for the above these would be good, dog, bad, cat. The output should be a line with DocId Score on it, followed by a sequence of num_result lines with this information for the top num_results many documents. For example, DocId Score 7 .65 2 .51 3 .23 11 .0012 You should use the ADT from the book for inverted index retrieval and implement your program using calls to the first, last, prev, next, docRight, docLeft, etc methods. You should be using galloping search for the implementation of the relevant methods. Once you have coded your program, I want you to come up with 10 paragraphs from Wikipedia pages of your choice. These will be your corpus. Decide on two phrases your would like to search about involving at least three terms. Decide by hand which documents you feel are relevant not relevant to your queries. Consider formulating a search for each of these phrases using either a purely conjunctive query or a purely disjunctive query. For each search phrase determine how your program would ranks results for the purely conjunctive and purely disjunctive query variation and determine the top 5 documents returned for each of these cases. Then by hand compute the precision@5 and recall@5 scores. Write up your results and say which kind of query performed the best.
|