HW#2 --- last modified April 05 2021 18:17:18.

Solution set.

Due date: Mar 17

Files to be submitted:
  Hw2.zip

Purpose: To get familiarity with ranking methods and coding inverted index methods over an underlying data structure for the inverted index.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO1 (Course Learning Outcome 1) -- Code a basic inverted index capable of performing conjunctive queries.

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Description: This homework consists of two parts a coding part and an experiment part. Submit all code and documents in the file Hw2.zip that you submit. Before doing the coding part, I want you to modify the definition of docRight and docLeft from class so that in addition to the base case of single words, it can handle a base case with exact phrase matches and also positions within documents. Put your modified definitions in a file positive.txt . Next I want you to code I want you to write a program which is capable of computing the results of positive boolean queries with exact phrases on a corpus using your modified docLeft, docRight and the approach from class. Your code should rank result according to the proximity model. Your program will be run from the command line with a line like:

aux_program PositivePhraseRank folder num_results query

Here aux_program is at most one of java, php, python (your choice). For php or python, PositivePhraseRank should be either PositivePhraseRank.php or PositivePhraseRank.py, respectively. If you choose to do this project in C or C++, aux_program is optional. If you use C/C++ or Rust, aux_program will be empty. Your program will be compiled on a Mac running MacOS Big Sur, having used brew to install the compiler. folder is the name of some folder on my hard drive. This folder will contain one or more .txt documents named in numeric order 1.txt, 2.txt, 3.txt ... which together will be considered the corpus for this assignment. These files will contain only ASCII characters. In these files, a "term" is any alpha-numeric sequence of characters that does not include a non-alpha-numeric character. For this homework, two terms are the same if they are the same after converting them to lower case. In the above, num_results indicates the top how many of the results satisfying the query should be returned. The query is a sequence of command line arguments in Polish notation. Each command line argument in the query is either _AND, _OR, a term, or a phrase. Here a term is defined to be an alphanumeric string not containing underscore. A phrase is concatenation of terms using _ as a separator. For example, L33t_Sp34K would be the phrase consisting of the two terms l33t and sp34k. An example of running the program might be:

python PositivePhraseRank.py my_corpus 5 "_OR good_dog _AND bad cat"

A cover for a positive boolean query is a sequence of terms from a document that would satisfy the query such that no smaller sequence would also satisfy the same query. So for a document "bad good dog good cat dog bad" the sub-sequences "good dog" and "cat dog bad" would be covers for "_OR good_dog _AND bad cat". The output of your program should be a line with DocId Score on it, followed by a sequence of num_results lines with this information for the top num_results proximity scoring many documents. For example,

DocId Score
7 .65
2 .51
3 .23
11 .0012

You should use the ADT from the book for inverted index retrieval and implement your program using calls to the first, last, prev, next, nextPhrase, prevPhrase, docRight, docLeft, etc methods. You should be using galloping search for the implementation of the relevant methods.

Once you have coded your program, I want you to come up with 10 paragraphs from Wikipedia pages of your choice. These will be your corpus. Decide on two phrases your would like to search about involving at least three terms. Decide by hand which documents you feel are relevant not relevant to your queries. Consider formulating a search for each of these phrases using either a purely conjunctive query or a purely disjunctive query. For each search phrase determine how your program would ranks results for the purely conjunctive and purely disjunctive query variation and determine the top 5 documents returned for each of these cases. Then by hand compute the precision@5 and recall@5 scores. Write up your results and say which kind of query performed the best.

Point Breakdown
positive.txt has descriptions of how to extend docRight and docLeft1pt
Code compiles and outputs as described on test corpus1pt
Can find in code where Inverted Index ADT implemented (1pt) and next and prev implemented using galloping search. (1pt)2pts
Positive phrase queries implemented using book's algorithms for nextPhrase and modified docRight, docLeft(1pt each)2pts
Modified proximity scores computed correctly and results sorted by them1pts
Code correctly retrieves docs on test cases and ranks them according to proximity scoring.1pt
Experiments and write-up in a file Experiments.pdf2pts
Total10pts