Chris Pollett>
Old Classses > |
HW#5 --- last modified December 01 2020 15:19:57.Due date: Dec 7 Files to be submitted: Purpose: To familiarize one self with data governance issues, to gain experience with algorithms related to analyzing big data. Related Course Outcomes:The main course outcomes covered by this assignment are: CLO6 -- Be able to evaluate the data governance of a concrete hypothetical organization according to a data management framework such as TDQM or CMMI CLO7 -- Be able to code or analyze a common clustering algorithm. Description: This homework consists of two parts: (a) An analysis of the data integration and process integration for a fictional e-business. (b) A short coding project to implement an agglomerative hierarchical clustering algorithm. For the first part, I want you to imagine BulkCo, a fictional warehouse store where consumers and small businesses can buy items in bulk. For example, you can buy one liter tubes of toothpaste. BulkCo has several databases. Its supply chain management and sales infrastructure is managed in a legacy PigIron 300 database that dates to the 1970s. It keeps track of palettes of goods at the BulkCo stores. It ensures that no store ever has more than two palettes of the same product. When the number of palettes of an item goes down to one a decision is made whether to continue the product, and, if so, to reorder it from supplier. This needs to be done before the last remaining palette of the item is completely sold. In addition to this system, BulkCo has an HR system for managing its employees and another system for managing its customers, their buying habits, as well keeping track of observed barcodes for both palette received and item sales. Both of these use a modern relational database. Finally, to examine trends in purchase and supply time-series, to suggest what should be in ad flyers, what products should be discontinued, and to hint at what products might be worth stocking, BulkCo also has a data warehouse. Given this set up do the following exercises:
This concludes the data integration portion of the homework. For the agglomerative hierarchical clustering algorithm portion of the homework I want you to write a program Cluster.java (use a different extension if you decide to use a different language). This program will be compiled from the command line with a syntax: javac Cluster.javaand run with the syntax: java Cluster some_folder_name some_number stop_list To understand what Cluster does, recall a 5 skipgram (t1 t2 _ t4 t5) is about a term t3 if there is some document in our document collection and some sentence in that document for which the phrase "t1 t2 t3 t4 t5" occurs. So in the phrase "the furry dog likes to chew a bone", the 5 skipgram (the furry _ likes to) is about dog. For starts and ends of sentences we pad with a special * symbol so each position in the sentence corresponds to skipgram about it. For example, ( * * _ furry dog) is about term "the". Define the distance between two terms t1 and t2 to be: 1 - [2*(the total number of skipgrams they share)/(the total number of skipgrams t1 is in + the total number of skipgrams t2 is in)]. So the distance between a term and itself is 0. Note this notion of distance is not a mathematical metric, but suffices for the algorithm. Define the distance between two clusters of terms to be the average distance of their members. Your Cluster program should read the .txt files in some_folder and first compute the some_number terms which are the terms that the most skipgrams are about for these text files and which do not appear in the file stop_list (a file of single space separated terms). Use the distance function just given to perform hierarchical clustering on these terms and output the tree that results, suitably pretty printed. For example, on the command: java Cluster animal_articles 3 my_stop_words you might get a tree that looks like: +-cat | +-+-dog | -+animal I will be flexible on the exact format that you use to output your trees. I will also give two bonus points if you write a map reduce program SkipDistance.java for Hadoop that computes the distance between each pair of terms in the some_folder corpus.
|