CS267 Fall 2019

CLO1 (Course Learning Outcome 1) -- Code a basic inverted index capable of performing conjunctive queries.

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

This homework consists of two parts a coding part and an experiment part. Submit all code and documents in the file Hw2.zip that you submit. For the coding part, I want you to write a program which is capable of computing the results of positive boolean queries on a corpus using the docLeft, docRight approach from class. Your code should rank result according to the vector space model. Your program will be run from the command line with a line like:

Here aux_program is at most one of java, php, python (your choice). For php or python, PositiveRank should be either PositiveRank.php or PositiveRank.py, respectively. If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS Mojave, having used brew to install the compiler. filename is the name of some text document. This document is assumed to contain only ASCII characters with '\n' used for line endings. For this assignment, the sequence '\n\n' indicates a new "document" within this file. In the above, num_results indicates the top how many of the results satisfying the query should be returned. The query is a sequence of command line arguments in Polish notation. Each command line argument in the query is either _AND, _OR, or a term. Here a term is defined to be an alphanumeric string not containing underscore. For this homework, two terms are the same if they are the same after converting them to lower case. An example of running the program might be:

To compute VSM scores use all the terms from the query. I.e., for the above these would be good, dog, bad, cat. The output should be a line with DocId Score on it, followed by a sequence of num_result lines with this information for the top num_results many documents. For example,

You should use the ADT from the book for inverted index retrieval and implement your program using calls to the first, last, prev, next, docRight, docLeft, etc methods. You should be using galloping search for the implementation of the relevant methods.

Once you have coded your program, I want you to come up with 10 paragraphs from Wikipedia pages of your choice. These will be your corpus. Decide on two phrases your would like to search about involving at least three terms. Decide by hand which documents you feel are relevant not relevant to your queries. Consider formulating a search for each of these phrases using either a purely conjunctive query or a purely disjunctive query. For each search phrase determine how your program would ranks results for the purely conjunctive and purely disjunctive query variation and determine the top 5 documents returned for each of these cases. Then by hand compute the precision@5 and recall@5 scores. Write up your results and say which kind of query performed the best.

Point Breakdown
Code is reasonably well-commented and structured	1pt
Code compiles and outputs as described on test corpus	1pt
Can find in code where Inverted Index ADT implemented (1pt) and next and prev implemented using galloping search. (1pt)	2pts
Positive queries implemented using books algorithm, docRight, docLeft are coded as per book (1pt each)	2pts
VSM similarity scores computed correctly and results sorted by them	1pts
Code correctly retrieves docs on test cases and ranks them according to VSM model.	1pt
Experiments and write-up in a file Experiments.pdf	2pts
Total	10pts

HW#2 --- last modified September 16 2019 19:35:19.