Chris Pollett> Old Classses >
CS257
( Print View )

Student Corner:
[Final Exam-PDF]

[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#4 --- last modified December 03 2020 09:48:08.

Solution set.

Due date: Nov 23

Files to be submitted:
Hw4.zip

Purpose: To gain familiarity with advanced DBMS data structures, to learn about OLAP databases.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO4 -- Code a basic inverted index capable of performing conjunctive queries or code another advanced DBMS data structure such as Bloom Filters, or Log-structured Merge Trees.

CLO5 -- Be able to deploy a concrete business intelligence system making use of OLAP features in SQL.

Description:

This homework consists of two parts, for the first part I'd like you to attempt to code a Bloom filter and Log-Structure Merge Tree for the purposes of handling news feeds. For the second part of the homework I want you to create an OLAP Regional Health App.

Your first app should be called BloomLogInvertNews.java (or appropriate extension if you decide on a different language). It will be compiled from the command line using the command:

javac BloomLogInvertNews.java

Just as in Hw3, this program will be run in two ways:

java BloomLogInvertNews ingest some_folder

java BloomLogInvertNews query "some_query"

These commands will have similar behavior to Hw3 except you will not use MongoDB to store any news data. Instead, news item data will be stored as a concatenated sequence of (length, JSON object) pairs in a file news_data.txt. Here length is the length of the particular JSON object. The JSON object should have in addition to the item data, a hash property which contains the md5 hash of the title of the new item concatenated with its description. Before adding an item to the news_data.txt file you should use a Bloom Filter to check if an item with that hash has already been added or not. Only add a news item if its hash has not appeared before. If the ingest command detects a previously seen news item it should output its title and say "was previously seen" rather than add it to news_data.txt. The inverted index for your app will be stored in a sequence of files: tier_some_number_A_or_B.txt . For example, tier_1_A.txt. Such a file consists of a sequence of newline delimited lines, sorted by term of the format:

term_1:offset_1,offset_2,...
term_2:offset_1, offset_2

For example:

biden:0,255,1024
trump:4096,7511

Here 4096 next to trump above indicates there is a news item in news_data.txt starting at byte offset 4096 that contains the word trump. When we look up that byte offset we could read the pair (length, JSON object) to determine the length of the JSON object, read that many bytes and decode the object to something we can pretty print. We assume that in memory we can hold an inverted file with at most three news items worth of data. After which we need to write a tier_0 inverted file. If we ever have two inverted files of the same tier level we merge them to produce a file of one level higher. After running the ingest command on a folder you could potentially end up with many tier files. Do not merge these together. When you do a query command, check each tier file to check for results.

For the second part of the homework, imagine you are creating an app called HealthOlap.java for a regional government agency that is receiving health data from phone health apps were people opted-in to a share their data for research feature. It will be compiled from the command line with a command:

javac HealthOlap.java

Such a health agency might receive data into a WORK_OUT (longitude:float, latitude:float, workout_type:varchar(5), start-year:int, start-month:int, start-day:int, start-hour:int, start-minutes:int duration:int) table (have a file health.sql which creates such a table in a EXERCISE database and has at least 20 example inserts for a Mysql Database). An example row might be, (37.33939,-121.89496, 'swim', 2020, 11, 30, 23, 10, 60). Here longitude and latitude represent where an exercise workout occurred, the various start-components represent the time of the start of a work out, duration is how many minutes the workout lasted.

Your HealthOlap program should be run from the command line with a command like:

java HealthOlap region workout_type time_period

It should then output the average workout time for that region and time period. Further it should also output the rollup (your query should have a WITH ROLLUP as part of its GROUP UP) of these value. Here region is of the form lat1-long1:lat2-long2 representing a rectangle and time period can be YYYY or YYYY-mm, or YYYY-mm-dd. For example,

java HealthOlap 35-115:45-125 Bike 2020-10

might use a where clause for the region, but do a group by with roll up on the remaining parameters to output:

workout year month avg
Bike 2020   10   99.30
Bike 2020  NULL  80.10
Bike NULL  NULL 100.31
NULL NULL  NULL 100.50

This completes the description of what your programs need to do for this homework, below is the grading scheme.

Point Breakdown
BloomLogInvertNews.java compiles and runs as described, ingest can be used to create files based on a folder as described	1pt
Bloom filter (1.5pt) and hash function (0.5pt) are used as described to control whether news inserted	2pt
News items stored in news_data.txt as described	1pt
Log Structured Merge tree built for inverted index as described	2pts
BloomLogInvertNews query functionality works and search relevant tier files of inverted index to produce correct output news items	2pts
health.sql is as described	1pt
Olap program works as described using tables as described	1pt
Total	10pts