Chris Pollett> CS267
( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#4 --- last modified May 04 2022 19:29:11.

Solution set.

Due date: Apr 25

Files to be submitted:
  Hw4.zip

Purpose: To code and experiment with index compression and incremental index construction techniques. To evaluate experiment using trec_eval.

Related Course Outcomes:

CLO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from.

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework will consist of three parts an exercise part, a coding part, and an experiments part. For the exercise part, I want you to write up solutions to the following problems in a file Hw4.pdf that your include in your submitted Hw4.zip file.

  1. Exercise 5.7 (a) where change `rho` to `tau`, (b) where change `tau` to `tau'`, and (f) where change `tau'` to `rho` out of the book.
  2. Exercise 6.1 where Pr["a"] = 0.90 and Pr["b"] = 0.1.
  3. Exercise 6.6
  4. Suppose we are using incremental indexing and we have currently generations 1, 2, 5. For the next three new partitions to be written to disk, what merging will occur? What will be the new generations after the incremental merging has been done? (Show after each new partition).

For the coding part of the HW, I'd like you to code two programs: One that implements rankBM25_DocumentAtATime_WithHeaps and one that implements the rankBM25_TermAtATime algorithm but where you use the CONTINUE strategy from class. Your programs will be run from the command line with a line like:

aux_program Bm25DocAtATime folder num_results query 

aux_program Bm25TermAtATime folder num_results num_accum query 

For example,

python Bm25TermAtATime.py my_corpus_folder 5 50 "hi there you"

In the above, aux_program is at most one of java, php, python (your choice). For php or python, you can add the appropriate extension .php or .py to the name of the file, respectively. If you choose to do this project in C, Rust, or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS Monterey, having used brew to install the compiler. folder is the name of folder containing your corpus in .txt files as per Hw2, num_results is the desired number of search results to be returned by your program, num_accum is the number of accumulators your program should use for processing, and query is a white-space separated list of query terms on which to compute disjunctive query results. You can reuse earlier code for the storage/creation of the inverted index you used, you no longer need to do chargramming or stemming, but you should do case normalization in making terms. Your programs should output results in the format of a trec_top_file that could be used by the trec_eval program. You can have it set the qid to 0 or to some pre-defined value at the top of your program. Alternative, you can add a qid to the end of your command line parameters. You should also have a flag at the top of your programs to toggle information about how quickly the query result is calculated.

For the experiments, use the same IMDB corpus you had in HW2. Come up with at least two three word disjunctive queries. Make a trec_rel_file based on your judgements of which of the IMDB documents are relevant to these queries. Then run the trec_eval software on the output of your programs. Include a transcript.txt file showing these runs. Do some experiments to see effect of changing the accumulator number on your term at a time results. Do some other experiments to see the relative speeds of Bm25PDocAtATime and Bm25TermAtATime on the same queries. Put the write ups of your experiments in Hw4.pdf.

Point Breakdown

Exercise Problems (1pts each grades in 0.5pts increments depending on how correct)4pts
Each program compiles and runs correctly on test inputs (0.5pts each)1pt
Bm25DocAtATime implementation follows the rankBM25_DocumentAtATime_WithHeaps pseudo code from class (implement MaxScore optional)1pt
Bm25TermAtATime implementation follows the method outlined in class for the CONTINUE strategy1pt
transcripts.txt from trec_eval experiment seems reasonable1pt
Write up of accumulator varying experiment1pt
Write up of relative speed experiment1pt
Total 10pts