Chris Pollett> Old Classses >
CS267

( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#4 --- last modified October 23 2019 21:54:08.

Solution set.

Due date: Nov 13

Files to be submitted:
  Hw4.zip

Purpose:

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from.

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework will consist of a written part and a coding part. For the written part I want you to do the following problems:

  1. Exercise 5.7 (a), (b), (c) out of the book.
  2. Exercise 6.1 where Pr["a"] = 0.55 and Pr["b"] = 0.45.
  3. Exercise 6.2.
  4. Exercise 6.6 where `N_T/N = 0.05` and `M=2^6`.
  5. Come up with a variation on Simple-9 that uses only a 3-bit selector rather than a 4 bit selector. Give examples of gap sequences to compress that it would be better suited towards, give example distributions of gaps where Simple-9 would be better.

For the coding part of the HW, I'd like you to code a program that implements the rankBM25_DocumentAtATime_WithHeaps algorithm from class together with the MaxScore heuristic. Your program will be run from the command line with a line like:

aux_program Bm25MaxScore some_file num_results query

For example,

python Bm25MaxScore.py my_corpus.txt 5 "hi there you"

In the above, aux_program is at most one of java, php, python (your choice). For php or python, Bm25MaxScore should be either Bm25MaxScore.php or Bm25MaxScore.py, respectively. If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS Catalina, having used brew to install the compiler. filename is the name of some text document, num_results is the desired number of search results to be returned by your program, and query is a white-space separated list of query terms on which to compute disjunctive query results. Within the file filename, we assume corpus documents are determined in the same way as HW2. Your program should output results in the format of a trec_top_file that could be used by the trec_eval program.

Once you have completed your program, I want you to conduct some experiments. First I'd like you to run your program on the disjunctive queries you used for HW2 and the same corpus. Make a trec_rel_file based on your judgements from HW2 (include this with your HW4). Then run the trec_eval software on your output. Include a transcript.txt file showing a run of this.

Next I would like to design an experiment to compare the speed of your PositiveRank program from Hw2 and your Bm25MaxScore program from this HW for a large corpus. Describe your experiments and results in a file timing_tests.txt.

Point Breakdown
Modified Book Problems (1pts each grades in 0.5pts increments depending on how correct)5pts
Bm25MaxScore program compiles and runs correctly on test inputs1pt
Heap implementation in Bm25MaxScore is as in pseudo-code from class for rankBM25_DocumentAtATime_WithHeaps1pt
MaxScore heuristic is implemented (and easily found) in Bm25MaxScore program1pt
Files requested are present and write-up of trec_eval experiment is as described1pt
Timing test write-up1pt
Total10pts