Chris Pollett>
CS267 |
HW#5 --- last modified May 07 2022 04:50:16.Due date: May 16 Files to be submitted: Purpose: To gain experience implementing index compression methods and incremental indexing techniques. To learn about global scoring functions and to study map reduce algorithms. Related Course Outcomes: CLO3 -- Be able to explain where BM25, BM25F and divergence-from-randomness statistics come from. CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes. CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging. CLO7 -- Know at least one Map Reduce algorithm (for example to calculate page rank). Specification: As usual this homework has a written and a coding part. For the written part, do the following problems and put them in a file Problems.pdf which you submit in your Hw5.zip file:
For the coding part of the homework, I want you to split indexing query program we have been developing up till HW4 into two parts, an indexing part, IndexProgram (name is allowed to have an extension like .py or .php), and a search program, QueryProgram (name is allowed to have an extension like .py or .php). The first program will be tested from the command line with a syntax like: aux_program IndexProgram path_to_folder_to_index index_filenameWhen run this program should read each .txt file in path_to_folder_to_index in increasing alphabetical order. It should then write two files index_filename.dic and index_filename.txt. The file index_filename.dic begins first with the number of .txt files found in path_to_folder_to_index, on a line by itself, followed for each .txt file on its own line, its number in the alphabetical order colon the name of the file, colon the total number of terms in the file. For example, this portion of index_filename.dic might look like: 3 0:aardvark.txt:247 1:sally_loves_bob.txt:1501 2:zebra.txt:56 This portion of index_filename.dic can be viewed as a document map. The number before the colon is the doc id of the filename after the colon. So 2 is the doc id of zebra.txt above and it has a total of 56 terms. The rest of index_filename.dic should contain an inverted index of the .txt documents found, making use of these doc ids. For this index, the dictionary should only store character 4-grams, which are case normalized, with any punctuation stripped. Each term should be followed by a `gamma`-coded offset into the file index_filename.txt where the posting list for the term is stored. If the `gamma`-code for this offset has length not divisible by eight, pad it with 0 bits to fill out the last byte used to store it. Posting lists in the file index_filename.txt should be stored as follows: First code using vByte the modulus of Rice code for the doc id gaps. Then a first pair (`gamma`-code of first id, `gamma` code of term frequency in that doc), followed by subsequent pairs (Rice code of doc id gap, `gamma` code of term frequency). Your query program will be run from the command line like: aux_program QueryProgram index_filename query relevance_measure Here index_filename is the prefix name of the two index files produced by your IndexProgram, query is some query to run against this index file, and relevance_measure is one of DFR and your choice of LMJM, LMD. You should have a readme.txt file which besides listing your team members says which of these relevance measures you chose. Your program on the above input should use the index and compute a conjunctive query of the terms in the query (remember you need to map these to char-grams!) and score the resulting documents using the provided relevance measure (I'll accept scoring pretend that the char-gram are the terms). It should then output these documents in a descending order in a format usable by trec_eval. You should include with your project a test subfolder, which should have plain text documents with names as described above. Using this test set do some experiments to compare the measure you chose against DFR using trec_eval. Write up your results in Experiments.pdf. Point Breakdown
|