Chris Pollett> Old Classses >
CS267
( Print View )

Student Corner:
[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#5 --- last modified November 18 2019 21:19:30.

Solution set.

Due date: Dec 9

Files to be submitted:
Hw5.zip

Purpose: To gain experience implementing index compression methods and incremental indexing techniques. To learn about global scoring functions and to study map reduce algorithms.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO7 -- Know at least one Map Reduce algorithm (for example to calculate page rank).

Specification:

Do the following problems and submit them in Problems.pdf which you should include in your Hw5.zip folder:

Do Exercise 7.2 out of the book, but where `k=3` and all posting have size 8 bytes.
Suppose we are doing queries on a document-oriented index that uses logarithmic merging for incremental updates. Give pseudo-code for the basic inverted index ADT operations first, last, next, prev (maybe also firstDoc, lastDoc, nextDoc, prevDoc) suitable for such an index.
In a plain text ASCII document, define a paragraph to be any block of text preceeded by a start of document or two blanks line and continuing until an end of document or two blank lines. Define a sentence to be a substring of a paragraph beginning with the start of a paragraph or the character after a period and ending with an end of paragraph or a period. Write a map reduce algorithm for computing the average number of sentences in a paragraph for a corpus of plain text documents.
For a corpus of plain text documents, write a map reduce algorithm which computes the conditional probability of a term `t` occurring in a sentence given that the sentence has a term `t'`. i.e., p(t in sentence |t' in sentence).

For the coding part of the homework, I want you to split indexing query program we have been developing up till HW4 into two parts, an indexing part, IndexProgram (name is allowed to have an extension like .py or .php), and a search program, QueryProgram (name is allowed to have an extension like .py or .php). The first program will be tested from the command line with a syntax like:

aux_program IndexProgram path_to_folder_to_index index_filename

When run this program should read each .txt file in path_to_folder_to_index in increasing alphabetical order. It should then write to the file index_filename, first the number of .txt files found, on a line by itself, followed for each .txt file on its own line, its number in the alphabetical order colon the name of the file, colon the total number of terms in the file. For example, this portion of index_filename might look like:

3
0:aardvark.txt:247
1:sally_loves_bob.txt:1501
2:zebra.txt:56

This portion of index_filename can be viewed as a document map. The number before the colon is the doc id of the filename after the colon. So 2 is the doc id of zebra.txt above and it has a total of 56 terms. The rest of index_filename should contain an inverted index of the .txt documents found, making use of these doc ids. For this index, the dictionary should only store case normalized, punctuation stripped versions of terms, and should take a dictionary-as-string approach to its layout. Posting lists should be stored as delta lists of pairs of the form (Golomb-code compressed id gap, gamma-coded frequency of that term in the given document).

Your query program will be run from the command line like:

aux_program QueryProgram index_filename query relevance_measure

Here index_filename is the name of an index file that might have been produced by your IndexProgram, query is some query to run against this index file, and relevance_measure is one of DFR and your choice of LMJM, LMD. You should have a readme.txt file which besides listing your team members says which of these relevance measures you chose. Your program on the above input should use the index and compute a conjunctive query of the terms in query and score the resulting documents using the provided relevance measure. It should then output these in a descending order in a format usable by trec_eval. You should include with your project a test subfolder, which should have plain text documents with names as described above. Using this test set do some experiments to compare the measure you chose against DFR using trec_eval. Write up your results in Experiments.pdf.

Point Breakdown
Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	4pts
IndexProgram runs on the command line inputs as described, reading the provided directory, and producing an inverted index output file.	1pt
Dictionary in output file as described (stores case normalized, punctuation stripped terms using dictionary as a string) (0.5pts) and document map as described (0.5pts).	1pt
Postings in output file as described (0.5pts Golomb coded gaps, 0.5pts gamma coded frequencies)	1pt
QueryProgram runs as described, reading the inverted index file only once when computing the correct output documents	1
QueryProgram ranks output according to provided relevance measure	1pt
Experiment.pdf as described.	1pt
Total	10pts