Chris Pollett> Old Classses >
CS267

( Print View )

Student Corner:
[Final-PDF]

[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#4 --- last modified May 05 2021 18:43:42.

Solution set.

Due date: Apr 30

Files to be submitted:
  Hw4.zip

Purpose: To code and experiment with index compression and incremental index construction techniques. To evaluate experiment using trec_eval.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from.

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Description: This homework will consist of three parts an exercise part, a coding part, and an experiments part. For the exercise part, I want you to write up solutions to the following problems in a file Hw4.pdf that your include in your submitted Hw4.zip file.

  1. Exercise 5.7 (a) where change `rho` to `rho'`, (e), (f) out of the book.
  2. Exercise 6.1 where Pr["a"] = 0.7 and Pr["b"] = 0.3.
  3. Show the `delta` code is prefix free.
  4. Exercise 6.9
  5. Suppose we are using incremental indexing and we have currently generations 1, 3, 5. For the next three new partitions to be written to disk, what merging will occur? What will be the new generations after the incremental merging has been done? (Show after each new partition).

For the coding part of the HW, I'd like you to code a program that implements the rankBM25_TermAtATime_WithPruning algorithm from class Your program will be run from the command line with a line like:

aux_program Bm25Pruning folder num_results num_accum update_freq query 

For example,

python Bm25Pruning.py my_corpus_folder 5 50 100 "hi there you"

In the above, aux_program is at most one of java, php, python (your choice). For php or python, Bm25Pruning should be either Bm25Pruning.php or Bm25Pruning.py, respectively. If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS Big Sur, having used brew to install the compiler. folder is the name of folder containing your corpus in .txt files as per Hw2, num_results is the desired number of search results to be returned by your program, num_accum is the number of accumulators your program should use for processing, update_freq is the frequency at which the quota threshold should be updated and query is a white-space separated list of query terms on which to compute disjunctive query results. Your program should output results in the format of a trec_top_file that could be used by the trec_eval program.

Once you have completed your program, I want you to conduct some experiments. I'd like you to run your program on the disjunctive queries you used for HW2 and the same corpus. Make a trec_rel_file based on your judgements from HW2 (include this with your HW4). Then run the trec_eval software on your output. Include a transcript.txt file showing a run of this.

Point Breakdown
Exercise Problems (1pts each grades in 0.5pts increments depending on how correct)5pts
Bm25Pruning program compiles and runs correctly on test inputs1pt
Term-at-a-time processing implementation in Bm25Pruning is as in pseudo-code from class for rankBM25_TermAtATime_WithPruning1pt
Accumulators used is as per command line argument, the pruning mechanism via thresholding is correctly implemented1pt
Files requested are present and write-up of trec_eval experiment is as described1pt
Total10pts