Chris Pollett> CS267
( Print View )

Student Corner:
[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#5 --- last modified December 06 2023 13:44:38.

Solution set.

Due date: Dec 9

Files to be submitted:
Hw5.zip

Purpose: To gain experience implementing index compression methods and incremental indexing techniques. To learn about global scoring functions and to study map reduce algorithms.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO3 -- Be able to explain where BM25, BM25F and divergence-from-randomness statistics come from.

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO7 -- Know at least one Map Reduce algorithm (for example to calculate page rank).

Specification:

As usual this homework has a written and a coding part. For the written part, do the following problems and put them in a file Problems.pdf which you submit in your Hw5.zip file:

Exercise 7.2 out of the book, but where `k=4` and postings are all 2 bytes.
Come up with pseudo-code for maybe also firstDoc, lastDoc, nextDoc, prevDoc which works with an index that is being stored using Logarithmic merging.
In a plain text ASCII document, define a paragraph to be any block of text preceded by a start of document or two blanks line and continuing until an end of document or two blank lines. Consider the punctuation symbols period, comma, and semi-colon. Write a map reduce algorithm to compute the average number of occurrences of these symbols in a paragraph for a corpus of plain text documents.
For a corpus of plain text documents, write a map reduce algorithm which computes for all terms `t` the conditional probability of `t` occurring in a paragraph given that the paragraph has a term 'calvacade' in it. i.e., p(t in paragraph |'calvacade' in paragraph).

For the coding part of your homework, I want you to put together and extend a few different aspects of the code that you have been developing since HW2. Like HW3, you will write two programs: an IndexProgram, which writes an index file, and QueryProgram, which given a query and an index file, produces a ranked list of documents in a format sutiable for trec_eval.

The IndexProgram now should store posting list using some of the coding formats we've learned, using code that you write rather than a library. It will be tested from the command line with a syntax like:

aux_program IndexProgram corpus_filename index_filename

When run this program should read in the corpus_filename. This file will be in the same format as Hw2. It should then write two files index_filename.dic and index_filename.txt. The file index_filename.dic begins first with the number of documents found in corpus_filename stored using a `gamma` code (Make sure all outputs are binary! Don't ASCII codes for 0 and 1!). This is then followed by a bit length `m` stored as a `gamma`-code such that the number of terms in each documents can be written down in less than `m` bits. Finally, the rest of the document is a sequence of `m`-bit numbers, one for each document in the corpus in the order they appeared in the corpus (i.e., the `i`th number would be the number of terms in the `i`th document in the corpus). The index_filename.txt should use the following dictionary-as-a-string format: It should begin with two `gamma`-encoded offsets: the first offset points to where the term, num_docs_with_term, offset tuples are stored, the first offset + the second offset points to where the posting lists begin. Next the number `w` should be `gamma`-coded where `w` is a number sufficiently large that any integer offset into the the list of term, num_docs_with_term, offset tuples can be encoded using less than `w` bits. The number `w` should be followed by the integer offsets into the term, num_docs_with_term, offset tuple list. After this list, should be the list of term, vByte number of documents containing the term, vByte coded posting list offset tuples. Finally, the posting lists should consist each be in the format: `gamma`-encoded number of documents containing the term, followed by a modulus to use with a Rice code for posting gaps, followed by a sequence of `gamma`-encoded frequency of term in doc, Rice coded gap from previous docs pairs (where the modulus used is the one at the start of the posting list).

Your QueryProgram can use your choice of the rankBM25_DocumentAtATime_WithHeapsMS or rankBM25_TermAtATime algorithms from class and which you developed in Hw4. Your query program will be run from the command line like:

aux_program QueryProgram index_filename query relevance_measure

Here index_filename is the prefix name of the two index files produced by your IndexProgram, query is some query to run against this index file, and relevance_measure is one of DFR and your choice of LMJM, LMD. You should have a readme.txt file which besides listing your team members says which of these relevance measures you chose. Your program on the above input should use the index and compute a conjunctive query of the terms in the query and score the resulting documents using the provided relevance measure. It should then output these documents in a descending order in a format usable by trec_eval. Next test your programs using the corpus you had for Hw2. Do some experiments to compare the measure you chose against DFR using trec_eval. Write up your results in Experiments.pdf.

Point Breakdown
Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	4pts
IndexProgram runs on the command line inputs as described, reading the provided corpus file, and producing an inverted index output file.	0.5pts
Output format of the document map (the .dic file) as described (0.5pts).	0.5pt
Dictionary in output file as described (0.5pts for first sequence of int's as described, 0.5pts for term, num_docs_with_term, offset tuples format as described).	1pt
Postings in output file as described (0.5pts rice coded gaps, 0.5pts gamma coded frequencies)	1pt
QueryProgram runs as described, reading the inverted index file only once when computing the correct output documents	1pt
QueryProgram ranks output according to provided relevance measure	1pt
Experiments (0.5pts) and write-up (0.5pts)	1pt
Total	10pts