Chris Pollett> CS267
( Print View )

Student Corner:
[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#4 --- last modified November 29 2023 19:18:33.

Solution set.

Due date: Nov 17

Files to be submitted:
  Hw4.zip

Purpose: To code and experiment with index compression and incremental index construction techniques. To evaluate experiment using trec_eval.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from.

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework will consist of three parts an exercise part, a coding part, and an experiments part. For the exercise part, I want you to write up solutions to the following problems in a file Hw4.pdf that your include in your submitted Hw4.zip file.

  1. Exercise 5.7 (c) and (d) where you change `tau` to `tau'`, (e) where you change `rho` to `rho'`.
  2. Exercise 6.1 where Pr["a"] = 0.75 and Pr["b"] = 0.25.
  3. Show the `omega` code is prefix free (use induction).
  4. Exercise 6.7
  5. Suppose we are using incremental indexing and we have currently generations 1, 2, 4. For the next three new partitions to be written to disk, what merging will occur? What will be the new generations after the incremental merging has been done? (Show after each new partition).

For the coding part of the HW, I'd like you to code two programs: One that implements rankBM25_DocumentAtATime_WithHeapsMS (MS == MaxScore, so modify code to use book alg for MaxScore) and one that implements the rankBM25_TermAtATime algorithm but where you use the QUIT strategy from class. These programs can leverage off your Hw2 code. Your programs will be run from the command line with a line like:

aux_program rankBM25_DocumentAtATime_WithHeapsMS corpus_file num_results query 

 and

aux_program rankBM25_TermAtATime corpus_file num_results num_accum query 

For example,

python rankBM25_TermAtATime.py my_corpus.txt 5 50 "hi there you"

In the above, aux_program is at most one of java, php, python (your choice). For php or python, you can add the appropriate extension .php or .py to the name of the file, respectively. If you choose to do this project in C, Rust, or C++, aux_program is optional. corpus_file is the name of file containing your corpus as per Hw2 format, num_results is the desired number of search results to be returned by your program, num_accum is the number of accumulators your program should use for processing, and query is a white-space separated list of query terms on which to compute disjunctive query results. You can reuse earlier code for the storage/creation of the inverted index you used, you no longer need to do chargramming or stemming, but you should do case normalization in making terms. Your programs should output results in the format of a trec_top_file that could be used by the trec_eval program. You can have it set the qid to 0 or to some pre-defined value at the top of your program. Alternative, you can add a qid to the end of your command line parameters. You should also have a flag at the top of your programs to toggle information about how quickly the query result is calculated.

Once you have completed your program, I want you to conduct some experiments. Use your Wikipedia corpus from HW2. Come up with at least two 3-word disjunctive queries. Make a trec_rel_file based on your judgements of which of the Wikipedia documents are relevant to these queries. Then run the trec_eval software on the output of your programs. Include a transcript.txt file showing these runs. Do some experiments to see effect of changing the accumulator number on your term at a time results. Do some other experiments to see the relative speeds ofrankBM25_DocumentAtATime_WithHeapsMS and rankBM25_TermAtATime on the same queries. Put the write ups of your experiments in Hw4.pdf.

Exercise Problems (1pts each grades in 0.5pts increments depending on how correct)5pts
Each program compiles and runs correctly on test inputs (0.5pts each)1pt
rankBM25_DocumentAtATime_WithHeapsMS implementation follows the rankBM25_DocumentAtATime_WithHeaps pseudo code from class modified to do the MaxScore strategy1pt
rankBM25_TermAtATime implementation follows the method outlined in class for the QUIT strategy1pt
transcripts.txt from trec_eval experiment seems reasonable1pt
Write up of accumulator varying experiment0.5pt
Write up of relative speed experiment0.5pt
Total10pts