Chris Pollett> CS267
( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#2 --- last modified March 14 2022 09:26:27.

Solution set.

Due date: Mar 7

Files to be submitted:
Hw2.zip

Purpose: To get familiarity with ranking methods and coding inverted index methods over an underlying data structure for the inverted index.

Related Course Outcomes:

CLO1 (Course Learning Outcome 1) -- Code a basic inverted index capable of performing conjunctive queries.

CLO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:This homework consists of two parts: a coding part and an experiment part. Submit all code and documents in the file Hw2.zip that you submit. For the coding part, we use the character n-gram based inverted index of HW1 to do query processing. To this end, you should first modify the pseudo-code of nextPhrase, the input remains a sequence of terms t[1],...,[n] and a position, for example, "The", "quick", "brown", "fox", 57. Now, however, we assume first, next, prev, last, operate on an inverted index made of character m-grams for some fixed `m > 2`, that and positions are pairs u:v where u is a doc_id and v is a character offset into the document where strings of only whitespace and punctuation are treated as a single space (So."Hi.. There" would be converted to "hi there" before additional processing). For example, the index might have an entry "_bro" and a posting list for all occurrences of _bro. Put pseudo-code for this revised nextPhrase in a file next_phrase.txt that submit along with the homework. Next I want you to modify the definition of docRight and docLeft from class so that in addition to the base case of single words, it can handle a base case with exact phrase matches and it assumes that first, next, prev, last, are as we just described. Put your modified definitions in a file positive.txt . Given you have developed this pseudo-code, I then want you to write actual code capable of computing the results of conjunctive queries with exact phrases on a corpus using your modified docLeft, docRight and the approach from class. Your code should rank result according to the proximity model. Your program will be run from the command line with a line like:

aux_program ConjunctiveGramRank folder m_for_char_gram num_results query

Here aux_program is at most one of java, php, python (your choice). For php or python, ConjunctiveGramRank should be either ConjunctiveGramRank.php or ConjunctiveGramRank.py, respectively. . If you use C/C++ or Rust, aux_program will be empty. Your program will be compiled on a Mac running MacOS Monterey, having used brew to install the compiler. m_for_char_gram is how many characters should be in a single char gram. folder is the name of some folder on my hard drive. This folder will contain one or more .txt documents named in numeric order 1.txt, 2.txt, 3.txt ... which together will be considered the corpus for this assignment. These files will contain only ASCII characters. In these files, a "term" is any alpha-numeric sequence of characters that does not include a non-alpha-numeric character. For this homework, two terms are the same if they are the same after converting them to lower case. In the above, num_results indicates the top how many of the results satisfying the query should be returned. The query is a sequence of command line arguments. Each such argument will be either a term or a phrase. Here a term is defined to be an alphanumeric string not containing underscore. A phrase is concatenation of terms using _ as a separator. For example, L33t_Sp34K would be the phrase consisting of the two terms l33t and sp34k. An example of running the program might be:

python ConjunctiveGramRank.py my_corpus 4 5 good_dog bad cat

This should return the top 5 documents by rank containing the phrase "good dog" and the terms "bad" and "cat".

The output of your program should be a line with DocId Score on it, followed by a sequence of num_results lines with this information for the top num_results proximity scoring many documents. For example,

DocId Score
7 .65
2 .51
3 .23
9 .11
6 .06

To calculate proximity scoring, a cover for such query is a sequence of terms and phrases from a document that would satisfy the query such that no smaller sequence would also satisfy the same query. So for a document "bad good dog good cat dog bad" only the sub-sequences "bad good dog good cat" would be a cover.

You should use the ADT consisting of your implementation of first, last, prev, next, nextPhrase, prevPhrase, docRight, docLeft, etc based on the pseudo-code you made above. You should use galloping search for the implementation of prev, next.

Once you have coded your program, I want you to come up with 10 paragraphs from IMDB pages of your choice. These will be your corpus. Decide on two queries you would like to search about each involving at least one phrase and two additional terms. Decide by hand which documents you feel are relevant not relevant to your queries. For each of these searches determine how your program would ranks results and determine the top 5 documents returned for each of these cases. Then by hand compute the precision@5 and recall@5 scores. Write up your results and say which kind of query performed the best.

Point Breakdown

next_phrase.txt has working pseudo-code for your char-gram modified nextPhrase	1pt
positive.txt has descriptions of how to extend docRight and docLeft to the char gram, nextPhrase situation	1pt
Code compiles and outputs as described on test corpus	1pt
Can find in code where Inverted Index ADT implemented (1pt) and next and prev implemented using galloping search. (1pt)	2pts
Conjunctive queries results correctly computed using your modified nextPhrase, docRight, and docLeft.	1pt
Modified proximity scores computed correctly and results sorted by them	1pt
Code correctly retrieves docs on test cases and ranks them according to proximity scoring.	1pt
Experiments and write-up in a file Experiments.pdf	2pts
Total	10pts