Chris Pollett> Old Classses >
CS257
( Print View )

Student Corner:
[Final Exam-PDF]

[Submit Sec1]
[Grades Sec1]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#5 --- last modified December 01 2020 07:19:57.

Solution set.

Due date: Dec 7

Files to be submitted:
Hw5.zip

Purpose: To familiarize one self with data governance issues, to gain experience with algorithms related to analyzing big data.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO6 -- Be able to evaluate the data governance of a concrete hypothetical organization according to a data management framework such as TDQM or CMMI

CLO7 -- Be able to code or analyze a common clustering algorithm.

Description:

This homework consists of two parts: (a) An analysis of the data integration and process integration for a fictional e-business. (b) A short coding project to implement an agglomerative hierarchical clustering algorithm.

For the first part, I want you to imagine BulkCo, a fictional warehouse store where consumers and small businesses can buy items in bulk. For example, you can buy one liter tubes of toothpaste. BulkCo has several databases. Its supply chain management and sales infrastructure is managed in a legacy PigIron 300 database that dates to the 1970s. It keeps track of palettes of goods at the BulkCo stores. It ensures that no store ever has more than two palettes of the same product. When the number of palettes of an item goes down to one a decision is made whether to continue the product, and, if so, to reorder it from supplier. This needs to be done before the last remaining palette of the item is completely sold. In addition to this system, BulkCo has an HR system for managing its employees and another system for managing its customers, their buying habits, as well keeping track of observed barcodes for both palette received and item sales. Both of these use a modern relational database. Finally, to examine trends in purchase and supply time-series, to suggest what should be in ad flyers, what products should be discontinued, and to hint at what products might be worth stocking, BulkCo also has a data warehouse.

Given this set up do the following exercises:

For each of the data integration patterns of: Data consolidation, Data federation, Data propagation, Changed Data Capture, and Data Virtualization suggest a concrete instance of that pattern and reason why one might need to employ it as part of data process integration for this business.
Give an example business process for this business and a chart showing how the this process could be implemented using the orchestration pattern.
Some suppliers have multiple locations, some businesses that use BulkCo have multiple accounts, users might change their address or their credit cards. Describe a different business process than the one of the previous exercise related to BulkCo. For this other process, go through the fours steps of the Total Data Quality Management cycle to define, measure, analyze and improve the underlying data associated with this process and write with explanation what would happen for each step.

This concludes the data integration portion of the homework.

For the agglomerative hierarchical clustering algorithm portion of the homework I want you to write a program Cluster.java (use a different extension if you decide to use a different language). This program will be compiled from the command line with a syntax:

javac Cluster.java

and run with the syntax:

java Cluster some_folder_name some_number stop_list

To understand what Cluster does, recall a 5 skipgram (t1 t2 _ t4 t5) is about a term t3 if there is some document in our document collection and some sentence in that document for which the phrase "t1 t2 t3 t4 t5" occurs. So in the phrase "the furry dog likes to chew a bone", the 5 skipgram (the furry _ likes to) is about dog. For starts and ends of sentences we pad with a special * symbol so each position in the sentence corresponds to skipgram about it. For example, ( * * _ furry dog) is about term "the". Define the distance between two terms t1 and t2 to be: 1 - [2*(the total number of skipgrams they share)/(the total number of skipgrams t1 is in + the total number of skipgrams t2 is in)]. So the distance between a term and itself is 0. Note this notion of distance is not a mathematical metric, but suffices for the algorithm. Define the distance between two clusters of terms to be the average distance of their members. Your Cluster program should read the .txt files in some_folder and first compute the some_number terms which are the terms that the most skipgrams are about for these text files and which do not appear in the file stop_list (a file of single space separated terms). Use the distance function just given to perform hierarchical clustering on these terms and output the tree that results, suitably pretty printed. For example, on the command:

java Cluster animal_articles 3 my_stop_words

you might get a tree that looks like:

   +-cat
   |
 +-+-dog
 |
-+animal

I will be flexible on the exact format that you use to output your trees. I will also give two bonus points if you write a map reduce program SkipDistance.java for Hadoop that computes the distance between each pair of terms in the some_folder corpus.

Point Breakdown
Exercise 1 (0.5 per item + 0.5 overall writing)	3pts
Exercise 2 (0.5 process + 0.5 pattern)	1pts
Exercise 3 (0.5 description each step)	2pts
Cluster.java program compiles and runs as described above using command line arguments.	1pts
Cluster correctly computes the some_number most skipgram frequently about terms not in the stop_list file	1pt
Cluster correctly implements hierarchical clustering on these terms for the given distance function and pretty prints the resulting tree	2pts
Total	10pts