Chris Pollett>
Old Classses > |
HW#4 --- last modified December 03 2020 17:48:08.Due date: Nov 23 Files to be submitted: Purpose: To gain familiarity with advanced DBMS data structures, to learn about OLAP databases. Related Course Outcomes: The main course outcomes covered by this assignment are: CLO4 -- Code a basic inverted index capable of performing conjunctive queries or code another advanced DBMS data structure such as Bloom Filters, or Log-structured Merge Trees. CLO5 -- Be able to deploy a concrete business intelligence system making use of OLAP features in SQL. Description: This homework consists of two parts, for the first part I'd like you to attempt to code a Bloom filter and Log-Structure Merge Tree for the purposes of handling news feeds. For the second part of the homework I want you to create an OLAP Regional Health App. Your first app should be called BloomLogInvertNews.java (or appropriate extension if you decide on a different language). It will be compiled from the command line using the command: javac BloomLogInvertNews.java Just as in Hw3, this program will be run in two ways: java BloomLogInvertNews ingest some_folder or java BloomLogInvertNews query "some_query" These commands will have similar behavior to Hw3 except you will not use MongoDB to store any news data. Instead, news item data will be stored as a concatenated sequence of (length, JSON object) pairs in a file news_data.txt. Here length is the length of the particular JSON object. The JSON object should have in addition to the item data, a hash property which contains the md5 hash of the title of the new item concatenated with its description. Before adding an item to the news_data.txt file you should use a Bloom Filter to check if an item with that hash has already been added or not. Only add a news item if its hash has not appeared before. If the ingest command detects a previously seen news item it should output its title and say "was previously seen" rather than add it to news_data.txt. The inverted index for your app will be stored in a sequence of files: tier_some_number_A_or_B.txt . For example, tier_1_A.txt. Such a file consists of a sequence of newline delimited lines, sorted by term of the format: term_1:offset_1,offset_2,... term_2:offset_1, offset_2 For example: biden:0,255,1024 trump:4096,7511 Here 4096 next to trump above indicates there is a news item in news_data.txt starting at byte offset 4096 that contains the word trump. When we look up that byte offset we could read the pair (length, JSON object) to determine the length of the JSON object, read that many bytes and decode the object to something we can pretty print. We assume that in memory we can hold an inverted file with at most three news items worth of data. After which we need to write a tier_0 inverted file. If we ever have two inverted files of the same tier level we merge them to produce a file of one level higher. After running the ingest command on a folder you could potentially end up with many tier files. Do not merge these together. When you do a query command, check each tier file to check for results. For the second part of the homework, imagine you are creating an app called HealthOlap.java for a regional government agency that is receiving health data from phone health apps were people opted-in to a share their data for research feature. It will be compiled from the command line with a command: javac HealthOlap.java Such a health agency might receive data into a WORK_OUT (longitude:float, latitude:float, workout_type:varchar(5), start-year:int, start-month:int, start-day:int, start-hour:int, start-minutes:int duration:int) table (have a file health.sql which creates such a table in a EXERCISE database and has at least 20 example inserts for a Mysql Database). An example row might be, (37.33939,-121.89496, 'swim', 2020, 11, 30, 23, 10, 60). Here longitude and latitude represent where an exercise workout occurred, the various start-components represent the time of the start of a work out, duration is how many minutes the workout lasted. Your HealthOlap program should be run from the command line with a command like: java HealthOlap region workout_type time_period It should then output the average workout time for that region and time period. Further it should also output the rollup (your query should have a WITH ROLLUP as part of its GROUP UP) of these value. Here region is of the form lat1-long1:lat2-long2 representing a rectangle and time period can be YYYY or YYYY-mm, or YYYY-mm-dd. For example, java HealthOlap 35-115:45-125 Bike 2020-10 might use a where clause for the region, but do a group by with roll up on the remaining parameters to output: workout year month avg Bike 2020 10 99.30 Bike 2020 NULL 80.10 Bike NULL NULL 100.31 NULL NULL NULL 100.50 This completes the description of what your programs need to do for this homework, below is the grading scheme.
|