Chris Pollett>
CS267 |
HW#4 --- last modified November 29 2023 19:18:33.Due date: Nov 17 Files to be submitted: Purpose: To code and experiment with index compression and incremental index construction techniques. To evaluate experiment using trec_eval. Related Course Outcomes:The main course outcomes covered by this assignment are: CLO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from. CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes. CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging. CLO6 -- Be able to evaluate search results by hand and using TREC eval software. Specification: This homework will consist of three parts an exercise part, a coding part, and an experiments part. For the exercise part, I want you to write up solutions to the following problems in a file Hw4.pdf that your include in your submitted Hw4.zip file.
For the coding part of the HW, I'd like you to code two programs: One that implements rankBM25_DocumentAtATime_WithHeapsMS (MS == MaxScore, so modify code to use book alg for MaxScore) and one that implements the rankBM25_TermAtATime algorithm but where you use the QUIT strategy from class. These programs can leverage off your Hw2 code. Your programs will be run from the command line with a line like: aux_program rankBM25_DocumentAtATime_WithHeapsMS corpus_file num_results query and aux_program rankBM25_TermAtATime corpus_file num_results num_accum query For example, python rankBM25_TermAtATime.py my_corpus.txt 5 50 "hi there you" In the above, aux_program is at most one of java, php, python (your choice). For php or python, you can add the appropriate extension .php or .py to the name of the file, respectively. If you choose to do this project in C, Rust, or C++, aux_program is optional. corpus_file is the name of file containing your corpus as per Hw2 format, num_results is the desired number of search results to be returned by your program, num_accum is the number of accumulators your program should use for processing, and query is a white-space separated list of query terms on which to compute disjunctive query results. You can reuse earlier code for the storage/creation of the inverted index you used, you no longer need to do chargramming or stemming, but you should do case normalization in making terms. Your programs should output results in the format of a trec_top_file that could be used by the trec_eval program. You can have it set the qid to 0 or to some pre-defined value at the top of your program. Alternative, you can add a qid to the end of your command line parameters. You should also have a flag at the top of your programs to toggle information about how quickly the query result is calculated. Once you have completed your program, I want you to conduct some experiments. Use your Wikipedia corpus from HW2. Come up with at least two 3-word disjunctive queries. Make a trec_rel_file based on your judgements of which of the Wikipedia documents are relevant to these queries. Then run the trec_eval software on the output of your programs. Include a transcript.txt file showing these runs. Do some experiments to see effect of changing the accumulator number on your term at a time results. Do some other experiments to see the relative speeds ofrankBM25_DocumentAtATime_WithHeapsMS and rankBM25_TermAtATime on the same queries. Put the write ups of your experiments in Hw4.pdf.
|