Chris Pollett> CS267
( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#5 --- last modified May 06 2022 21:50:16.

Solution set.

Due date: May 16

Files to be submitted:
Hw5.zip

Purpose: To gain experience implementing index compression methods and incremental indexing techniques. To learn about global scoring functions and to study map reduce algorithms.

Related Course Outcomes:

CLO3 -- Be able to explain where BM25, BM25F and divergence-from-randomness statistics come from.

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO7 -- Know at least one Map Reduce algorithm (for example to calculate page rank).

Specification:

As usual this homework has a written and a coding part. For the written part, do the following problems and put them in a file Problems.pdf which you submit in your Hw5.zip file:

Problem 7.1 out of the book, but where `alpha=1.2`.
Suppose we are evaluating queries using rankBM25_TermAtATimeWithPruning in a setting where we are also using logarithmic merging for incremental updates. Where does the the logarithmic merging come into play in the rankBM25_TermAtATimeWithPruning algorithm? How might this affect one of the advantages of term at time processing over document at a time processing?
In a plain text ASCII document, define a paragraph to be any block of text preceeded by a start of document or two blanks line and continuing until an end of document or two blank lines. Define a sentence to be a substring of a paragraph beginning with the start of a paragraph or the character after a period and ending with an end of paragraph or a period. Write a map reduce algorithm for computing the median number of sentences in a paragraph for a corpus of plain text documents.
For a corpus of plain text documents, write a map reduce algorithm which computes for all terms `t` the conditional probability of t occurring in a sentence given that the sentence has a term 'Great' in it. i.e., p(t in sentence |'Great' in sentence).

For the coding part of the homework, I want you to split indexing query program we have been developing up till HW4 into two parts, an indexing part, IndexProgram (name is allowed to have an extension like .py or .php), and a search program, QueryProgram (name is allowed to have an extension like .py or .php). The first program will be tested from the command line with a syntax like:

aux_program IndexProgram path_to_folder_to_index index_filename

When run this program should read each .txt file in path_to_folder_to_index in increasing alphabetical order. It should then write two files index_filename.dic and index_filename.txt. The file index_filename.dic begins first with the number of .txt files found in path_to_folder_to_index, on a line by itself, followed for each .txt file on its own line, its number in the alphabetical order colon the name of the file, colon the total number of terms in the file. For example, this portion of index_filename.dic might look like:

3
0:aardvark.txt:247
1:sally_loves_bob.txt:1501
2:zebra.txt:56

This portion of index_filename.dic can be viewed as a document map. The number before the colon is the doc id of the filename after the colon. So 2 is the doc id of zebra.txt above and it has a total of 56 terms. The rest of index_filename.dic should contain an inverted index of the .txt documents found, making use of these doc ids. For this index, the dictionary should only store character 4-grams, which are case normalized, with any punctuation stripped. Each term should be followed by a `gamma`-coded offset into the file index_filename.txt where the posting list for the term is stored. If the `gamma`-code for this offset has length not divisible by eight, pad it with 0 bits to fill out the last byte used to store it. Posting lists in the file index_filename.txt should be stored as follows: First code using vByte the modulus of Rice code for the doc id gaps. Then a first pair (`gamma`-code of first id, `gamma` code of term frequency in that doc), followed by subsequent pairs (Rice code of doc id gap, `gamma` code of term frequency).

Your query program will be run from the command line like:

aux_program QueryProgram index_filename query relevance_measure

Here index_filename is the prefix name of the two index files produced by your IndexProgram, query is some query to run against this index file, and relevance_measure is one of DFR and your choice of LMJM, LMD. You should have a readme.txt file which besides listing your team members says which of these relevance measures you chose. Your program on the above input should use the index and compute a conjunctive query of the terms in the query (remember you need to map these to char-grams!) and score the resulting documents using the provided relevance measure (I'll accept scoring pretend that the char-gram are the terms). It should then output these documents in a descending order in a format usable by trec_eval. You should include with your project a test subfolder, which should have plain text documents with names as described above. Using this test set do some experiments to compare the measure you chose against DFR using trec_eval. Write up your results in Experiments.pdf.

Point Breakdown

Point Breakdown
Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	4pts
IndexProgram runs on the command line inputs as described, reading the provided directory, and producing an inverted index output file.	1pt
Dictionary in output file as described (stores case normalized char grams) (0.5pts) and document map as described (0.5pts).	1pt
Postings in output file as described (0.5pts rice coded gaps, 0.5pts gamma coded frequencies)	1pt
QueryProgram runs as described, reading the inverted index file only once when computing the correct output documents	1
QueryProgram ranks output according to provided relevance measure	1pt
Experiment.pdf as described.	1pt
Total	10pts