Chris Pollett> CS267
( Print View )

Student Corner:
[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#2 --- last modified September 27 2023 11:57:23.

Solution set.

Due date: Oct 6

Files to be submitted:
Hw2.zip

Purpose:To gain familiarity with ranking methods and coding inverted index methods over an underlying data structure for the inverted index.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO1 (Course Learning Outcome 1) -- Code a basic inverted index capable of performing conjunctive queries.

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of two parts: a coding part and an experiment part. Submit all code and documents in the file Hw2.zip that you submit. For the coding part, we use the inverted index of HW1 to do query processing, however, we make it fully positional. So where before we had say in the third line a number followed by a comma, such as 3 to indicate a term might be in the third document, we now have docid:pos1:pos2:pos3, for example, 3:2:9, to indicate the document and the positions of occurrences within a document. For the example document 3 at term locations 2 and 9. You should first modify the pseudo-code of nextPhrase, the input remains a sequence of terms t[1],...,t[n] and a position, for example, "The", "quick", "brown", "fox", 57. Now, however, we assume first, next, prev, last, operate on an inverted index as described above (a positional index as opposed to a schema-independent one), and that positions are pairs u:v where u is a doc_id and v is a character offset into the document where strings of only whitespace, punctuation, numbers, etc. are treated as in Hw1 (So "Hi9.. There" would be converted to "hi there" before additional processing). Put pseudo-code for this revised nextPhrase (i.e., revised so now works with document boundaries) in a file next_phrase.txt that submit along with the homework. Next I want you to modify the definition of docRight and docLeft from class so that in addition to the base case of single words, it can handle a base case with exact phrase matches and it assumes that first, next, prev, last, are as we just described. Put your modified definitions in a file positive.txt . Given you have developed this pseudo-code, I then want you to write actual code capable of computing the results of conjunctive queries with exact phrases on a corpus using your modified docLeft, docRight and the approach from class. Your code should rank results according to the vector space model using TF-IDF scores for column as presented in class. Your program will be run from the command line with a line like:

aux_program ConjunctiveRank filename num_results query

Here aux_program is at most one of java, php, python (your choice). For php or python, ConjunctiveRank should be either ConjunctiveRank.php or ConjunctiveRank.py, respectively. . If you use C/C++ or Rust, aux_program will be empty. Your program will be compiled on a Mac running the most recent MacOS, having used brew to install the compiler. filename is the name of some file on my hard drive where documents are indicated within the file as per Hw1. This file will contain only ASCII characters. In this files, a "term" is defined in the same way as Hw1 except two terms are the same if they are the same after converting them to lower case. In the above, num_results indicates the top how many of the results satisfying the query should be returned. The query is a sequence of command line arguments. Each such argument will be either a term or a phrase. Here a term is defined to be an alphanumeric string not containing underscore. A phrase is concatenation of terms using _ as a separator. For example, leet_speek would be the phrase consisting of the two terms leet and speak. An example of running the program might be:

python ConjunctiveRank.py my_corpus.txt 5 good_dog bad cat

This should return the top 5 documents by rank containing the phrase "good dog" and the terms "bad" and "cat".

The output of your program should be a line with DocId Score on it, followed by a sequence of num_results lines with this information for the top num_results cosine similarity scoring many documents. For example,

DocId Score
7 .65
2 .51
3 .23
9 .11
6 .06

You should use the ADT consisting of your implementation of first, last, prev, next, nextPhrase, prevPhrase, docRight, docLeft, etc based on the pseudo-code you made above. You should use galloping search for the implementation of prev, next.

Once you have coded your program, I want you to come up with 10 paragraphs from Wikipedia pages of your choice. These will be your corpus. Decide on two queries you would like to search about each involving at least one phrase and two additional terms. Decide by hand which documents you feel are relevant not relevant to your queries. For each of these searches determine how your program would rank results and determine the top 5 documents returned for each of these cases. Then by hand compute the precision@5 and recall@5 scores. Write up your results and say which kind of query performed the best.

next_phrase.txt has working pseudo-code for your modified nextPhrase	1pt
positive.txt has descriptions of how to extend docRight and docLeft to the nextPhrase situation	1pt
Code compiles and outputs as described on test corpus	1pt
Can find in code where Inverted Index ADT implemented (1pt) and next and prev implemented using galloping search. (1pt)	2pts
Conjunctive queries results correctly computed using your modified nextPhrase, docRight, and docLeft.	1pt
Vector space scores computed correctly and results sorted by them	1pt
Code correctly retrieves docs on test cases and ranks them according to cosine similarity scoring.	1pt
Experiments and write-up in a file Experiments.pdf	2pts
Total	10pts