Chris Pollett> Old Classses >
CS257

( Print View )

Student Corner:
[Final Exam-PDF]

[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#3 --- last modified October 27 2020 00:08:28.

Solution set.

Due date: Oct 30

Files to be submitted:
  Hw3.zip

Purpose:

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO3 -- Be able to code an application that makes use of a database API to a NoSQL store.

CLO4 -- Code a basic inverted index capable of performing conjunctive queries.

CLO8 -- Be able to implement in a big data stack one non-trivial Map Reduce algorithm (for example to calculate page rank).

Description:

This homework consists of the following parts: a map-reduce app to build inverted files based on RSS feeds, a query tool for your inverted file, and some experiments on B+-trees done using Sqlite.

For the map reduce portion, I want you to a write program InvertNews.java (modify the extension if you choose a different language than Java). I will compile this program from the command line using the syntax:

javac InvertNews.java

This program can be run in two ways:

java InvertNews ingest some_folder

or

java InvertNews query "some_query"

The ingest command should connect to a Mongodb database news. Any additional configurations for the connections should be in constants at the top of the file and explained in a readme.txt. It should then look in some_folder at each .rss file in that folder and insert any rss feed items found into a feed_items collection. Each item in the feed_items collection should have a unique integer timestamp field based on the Unix timestamp corresponding to the feed item's publication date. If two items would have the same timestamp, only keep the first. After inserting items, the ingest command should then run a Mongo map reduce job that outputs a collection inverted_file. This collection should consist of documents with two fields: term, the key, which has value a term (an alphanumeric sequence without whitespace); and a second field, postings, which is an array in descending order of timestamp of the timestamps of all the feed items that contained that term in either their title or description.

The InvertNews query command should take terms listed in the string some_query, and, using the inverted_file collection, output all the documents (suitably pretty printed) from the feed_items collection that contain all of the query terms in some_query. It should output the results in descending order of timestamp.

For the Sqlite portion of the homework, recall from class that we said that when you are doing right-only appends to a B+-tree you can speed up insertion by skipping the read down the tree to find the insert location. As pretty much everything in Sqlite is in a B+-tree, this seems like an ideal place to test if this is faster (or if right-only appends optimizations are implemented). I would like you to create three tables foo(a int autoincrement primary key, b int), goo(a int primary key, b int), and moo(a int primary key, b int). Time the following operations: (a) insert 100,000 rows into foo where the rows use the autoincremented value for a and a random int for b. (b) insert 100,000 rows into goo where the rows use increasing value for a (1,...,100000) and a random int for b. (c) insert 100,000 rows into moo where both (a) and (b) are chosen randomly. Have your three tests in a script BtreeAppendExperiment.java, which I can compile and run using the command lines:

javac BtreeAppendExperiment.java
java BtreeAppendExperiment name_of_sqlite_file

The latter command should create files "foo" + name_of_sqlite_file (this will have table foo), "goo" + name_of_sqlite_file (this will have table goo), "moo" + name_of_sqlite_file (this will have table moo) if they don't exist, and replace them otherwise. It should create the corresponding tables within these files then run the three timing tests. If you choose to use a different programming language than Java, I expect the file name to be similar to the above, but with that programmming languages file extension.

Interpret the results of your experiments in at least a couple paragraphs with a well-drawn conclusion. Try to make your analysis include your observation of the resulting B+-tree stored in the files themselves as you observe them in a hex editor.

Point Breakdown
BtreeAppendExperiment.java compiles, when run it creates the three tables. 1pt
InvertNews.java compiles and runs as described. 1pt
InvertNews ingest reads feed items from .rss files in some_folder into a feed_items collection as described. 1pt
InvertNews ingest uses MongoDB map reduce job used to create inverted_file collection as described (1pt java pt set up and invocation of job, 0.5pt map function, 0.5 reduce function). 2pts
InvertNews query command outputs only feed_items documents in descending order of time stamp that all the terms in some_query (0.5pt) and it used the inverted_file collection to do this (0.5pts). 1pt
Timing experiments conducted as described above. 1pt
Experiment write-up (1pt) with well-reasoned conclusion (1pt) 2pts
Experiment analysis includes your observations of the resulting B+-tree stored in the files themselves. 1pt
Total10pts