Chris Pollett>
Old Classses > |
HW#4 --- last modified May 05 2021 18:43:42.Due date: Apr 30 Files to be submitted: Purpose: To code and experiment with index compression and incremental index construction techniques. To evaluate experiment using trec_eval. Related Course Outcomes: The main course outcomes covered by this assignment are: CLO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from. CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes. CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging. CLO6 -- Be able to evaluate search results by hand and using TREC eval software. Description: This homework will consist of three parts an exercise part, a coding part, and an experiments part. For the exercise part, I want you to write up solutions to the following problems in a file Hw4.pdf that your include in your submitted Hw4.zip file.
For the coding part of the HW, I'd like you to code a program that implements the rankBM25_TermAtATime_WithPruning algorithm from class Your program will be run from the command line with a line like: aux_program Bm25Pruning folder num_results num_accum update_freq query For example, python Bm25Pruning.py my_corpus_folder 5 50 100 "hi there you" In the above, aux_program is at most one of java, php, python (your choice). For php or python, Bm25Pruning should be either Bm25Pruning.php or Bm25Pruning.py, respectively. If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS Big Sur, having used brew to install the compiler. folder is the name of folder containing your corpus in .txt files as per Hw2, num_results is the desired number of search results to be returned by your program, num_accum is the number of accumulators your program should use for processing, update_freq is the frequency at which the quota threshold should be updated and query is a white-space separated list of query terms on which to compute disjunctive query results. Your program should output results in the format of a trec_top_file that could be used by the trec_eval program. Once you have completed your program, I want you to conduct some experiments. I'd like you to run your program on the disjunctive queries you used for HW2 and the same corpus. Make a trec_rel_file based on your judgements from HW2 (include this with your HW4). Then run the trec_eval software on your output. Include a transcript.txt file showing a run of this.
|